Published on 2025-08-07T06:18:08Z

heritrix

Heritrix is the open-source, archival-quality web crawler used by the Internet Archive to create historical snapshots of websites for its famous Wayback Machine. Its purpose is digital preservation, not search indexing. It aims to capture a complete version of a site as it existed at a moment in time, providing a valuable resource for researchers, historians, and the general public, while also serving as a free, de facto backup for website owners.

What is heritrix?

Heritrix is the open-source web crawler developed and used by the Internet Archive. Its mission is to collect and preserve digital content for future generations. As the primary tool behind the Wayback Machine, Heritrix systematically browses the web to create comprehensive snapshots of websites. It identifies itself in server logs with a user-agent string like Mozilla/5.0 (compatible; heritrix/3.1.1 +http://archive.org). Unlike search engine crawlers, Heritrix is focused on completeness, aiming to capture not just text but also the CSS, JavaScript, and images needed for accurate historical rendering. The collected content is stored in Web ARChive (WARC) files for preservation.

Why is heritrix crawling my site?

Heritrix is crawling your website to preserve its content as a historical record. Its presence in your logs indicates that your site is being archived as part of the Internet Archive's mission to create a digital library of the internet. This is a non-commercial, preservation-focused activity. The crawler may visit periodically, with the frequency depending on your site's visibility and perceived cultural or historical significance. This is generally considered an authorized activity under fair use principles for archival purposes.

What is the purpose of heritrix?

The purpose of Heritrix is digital preservation in an era of ephemeral web content. It powers the Internet Archive's Wayback Machine, which has preserved billions of web pages. This service is invaluable for researchers, historians, and the public, providing access to historical web content that might otherwise be lost. For website owners, Heritrix provides the benefit of having your content preserved, ensuring it remains accessible even if your site goes offline. This can be particularly useful for documenting organizational history or maintaining access to important information.

How do I block heritrix?

While the archival work of the Internet Archive is widely seen as a public good, you can prevent Heritrix from crawling your site if you choose. You can add a disallow rule to your robots.txt file.

To block Heritrix, add the following lines to your robots.txt file:

User-agent: heritrix
Disallow: /

How to verify the authenticity of the user-agent operated by Internet Archive?

Reverse IP lookup technique

To verify user-agent authenticity, you can use host linux command two times with the IP address of the requester.

```
> host IPAddressOfRequest
```
This command returns the reverse lookup hostname (e.g., 4.4.8.8.in-addr.arpa.).

> host ReverseDNSFromTheOutputOfFirstRequest

If the output matches the original IP address and the domain is associated with a trusted operator (e.g., Internet Archive), the user-agent can be considered legitimate.

IP list lookup technique

Some operators provide a public list of IP addresses used by their crawlers. This list can be cross-referenced to verify a user-agent's authenticity. However, both operators and website owners may find it challenging to maintain an up-to-date list, so use this method with caution and in conjunction with other verification techniques.