web-archive

Archive-It (Internet Archive)

The Internet Archive's subscription archiving service, used by more than a thousand libraries, governments, and universities to build curated web collections that together hold tens of billions of documents. Captures are viewed through replay links on wayback.archive-it.org, organized by collection. It pays off when a specific institution's collection covers your subject; coverage is collection-by-collection, so it is not the place to look up arbitrary URLs.

Web pages / sites

Search this archive

No programmatic check — opens the archive’s own search.

Why it’s useful & how it works

FINDING: the /all/ aggregate CDX is 403 (the known DDoS-block), but per-collection CDX (e.g. /2950/timemap/cdx) returns 200 both ways. Integrate per-collection, not /all. Need a collection-id strategy (or accept replay-link-only for arbitrary URLs).

What’s inside

Tens of billions of docs; PB-scale.

API access

Per-collection CDX https://wayback.archive-it.org/ <collId>/timemap/cdx?url= (works) ; replay https://wayback.archive-it.org/ <collId>/<ts>/<url> . Aggregate /all/ CDX is blocked.

Access

Programmatic API access (a key may be required — see the API tag).

Homepage

https://archive-it.org/