Pipeline Monitor

CC webgraph fetch + PageRank per TLD

TLDProgressWersjeDiskDaneStatus
.pl
100%
51/519.3GBDONE
.cz
71%
36/517.5GBRUNNING
.eu
0%
0/515.1GBERROR
.info
0%
0/517.4GBERROR
.de
6%
3/51RUNNING
.org
0%
0/51RUNNING
.it
0%
0/51RUNNING
.fr
0%
0/51RUNNING
1
TLD done
5
TLD running
4
ma wyniki
0
pending
Ostatnie logi
[pipeline_sequential.log] Start: Sat Jun 27 08:48:17 PM CEST 2026
[pipeline_sequential.log] ======================================================
[pipeline_sequential.log] [20:48:17] START fetch .de
[pipeline_sequential.log] PID 2923690 → logs/fetch_de.log
[pipeline_sequential.log] Czekam na pobranie wszystkich TLD...
[va_scan.log] ✅ zxiaodaochu.com (1 crawli)
[va_scan.log] ✅ zxqxztuz.org (1 crawli)
[va_scan.log] ✅ zxzshangmao.com (1 crawli)
[va_scan.log] ✅ zz2959.com (1 crawli)
[va_scan.log] ✅ zztoptribute.dk (1 crawli)
[wayback_undiscovered.log] Outlinks: 0 unique domains
[wayback_undiscovered.log] [17/17] pillme.pl (3 CC links)
[wayback_undiscovered.log] CDX: 0 HTML snapshots
[wayback_undiscovered.log] Done. Processed: 17, with outlinks: 5
[wayback_undiscovered.log] Output: /var/www/lethe/data/result_wayback_outlinks.csv
Faza 1: fetch_webgraph_tld.py → S3 CC bucket → filtrowanie per TLD
Faza 2: pagerank.py → DuckDB ranks table
Faza 3: analyze_expired_tld.py --top 3000 --min-backlinks 5 → result_expired_*.csv
Pipeline: /var/www/lethe/run_tld_pipeline.sh de org it fr