web-archive
Common Crawl (CDX index)
A nonprofit project that crawls the open web at petabyte scale and publishes everything it collects, with roughly 2.2 billion pages per crawl and a fresh crawl released regularly. Each crawl has a free, keyless index API that reports whether and when a URL was captured, and the stored page content can then be pulled from the public dataset. It is a series of crawl snapshots rather than an on-demand archive, so it suits checking how a site looked in recent crawl windows or doing bulk web research.
Why it’s useful & how it works
Keyless, datacenter-friendly, fast both ways. Pull the latest collection id dynamically from collinfo.json (don't hardcode — the prior 2025-43 was already stale). It's a crawl index (stored copy), not arbitrary historical replay.
What’s inside
~2.2B pages/crawl; latest CC-MAIN-2026-17 (Apr 2026). Also mirrored to Hugging Face.
API access
https://index.commoncrawl.org/CC-MAIN-2026-17-index?url= <url>&output=json ; collection list https://index.commoncrawl.org/collinfo.json ; WARC on s3://commoncrawl/ and https://data.commoncrawl.org/
Access
Programmatic API access (a key may be required — see the API tag).