web-archive

Common Crawl (CDX index)

A nonprofit project that crawls the open web at petabyte scale and publishes everything it collects, with roughly 2.2 billion pages per crawl and a fresh crawl released regularly. Each crawl has a free, keyless index API that reports whether and when a URL was captured, and the stored page content can then be pulled from the public dataset. It is a series of crawl snapshots rather than an on-demand archive, so it suits checking how a site looked in recent crawl windows or doing bulk web research.

Web pages / sites API

Search this archive

Why it’s useful & how it works

Keyless, datacenter-friendly, fast both ways. Pull the latest collection id dynamically from collinfo.json (don't hardcode — the prior 2025-43 was already stale). It's a crawl index (stored copy), not arbitrary historical replay.

What’s inside

~2.2B pages/crawl; latest CC-MAIN-2026-17 (Apr 2026). Also mirrored to Hugging Face.

API access

https://index.commoncrawl.org/CC-MAIN-2026-17-index?url= <url>&output=json ; collection list https://index.commoncrawl.org/collinfo.json ; WARC on s3://commoncrawl/ and https://data.commoncrawl.org/

Access

Programmatic API access (a key may be required — see the API tag).

Homepage

https://commoncrawl.org/