Skip to content
OpenCatalogcurated by FLOSSK
Archiving & digital preservation

Heritrix

Extensible, web-scale, archival-quality crawler produced by the Internet Archive for capturing sites into WARC files.

Why it is included

The classic open crawler behind broad web preservation programs and many pywb ingest pipelines.

Best for

Organizations running scoped crawls with politeness rules and WARC output requirements.

Strengths

  • Mature crawl engine
  • WARC focus
  • Broad adoption

Limitations

  • Java tuning; legal/robots ethics training mandatory

Good alternatives

Browsertrix · wget + warc-tools · commercial crawlers

Related tools