Published on 2025-08-07T06:18:08Z

Arquivo-web-crawler

Arquivo-web-crawler is the official web archiving bot for Arquivo.pt, Portugal's national web archive. Its mission is to capture and preserve websites, especially those with cultural or historical significance to Portugal, for long-term public and research access. Unlike search engine bots, its goal is to create a complete, historically accurate snapshot of a webpage, including all its assets.

What is Arquivo-web-crawler?

Arquivo-web-crawler is a preservation-focused web crawler operated by Arquivo.pt, Portugal's national web archive. Built on the open-source Heritrix platform, its goal is to capture and store complete snapshots of web pages for historical and research use. It identifies itself with the user-agent Arquivo-web-crawler (compatible; heritrix/3.4.0-20200304 +https://arquivo.pt/faq-crawling). Unlike commercial search crawlers that prioritize text, this bot aims to retrieve all page assets (CSS, JavaScript, images) to ensure that the archived version can be rendered accurately as it appeared at a specific point in time.

Why is Arquivo-web-crawler crawling my site?

The Arquivo-web-crawler is visiting your site as part of a national digital preservation initiative. Its presence in your logs means your website has been identified as having content worth archiving for historical or research value, with a particular focus on sites within the Portuguese web sphere or those with cultural and educational significance. Crawl frequency varies based on the site's perceived importance; high-traffic government sites might be crawled daily, while others could be archived monthly or quarterly. This is a legitimate archival effort, similar to the work of the Internet Archive.

What is the purpose of Arquivo-web-crawler?

The primary purpose of Arquivo-web-crawler is to support Arquivo.pt's mission to preserve the Portuguese web for future generations. It collects web content that might otherwise be lost as sites change or go offline, creating a permanent historical record. The archived data is stored in standardized Web ARChive (WARC) files and made publicly accessible through the Arquivo.pt interface. This allows researchers, historians, and the public to view websites as they appeared in the past. For website owners, this service provides a form of digital preservation, ensuring your content remains part of a historical record with long-term accessibility.

How do I block Arquivo-web-crawler?

If you wish to prevent your website from being archived by Arquivo.pt, you can add a rule to your robots.txt file. This file provides instructions to web crawlers about which sections of your site they should not access.

To block Arquivo-web-crawler, add the following lines to your robots.txt file:

User-agent: Arquivo-web-crawler
Disallow: /

How to verify the authenticity of the user-agent operated by Arquivo.pt?

Reverse IP lookup technique

To verify user-agent authenticity, you can use host linux command two times with the IP address of the requester.
  1. > host IPAddressOfRequest
    This command returns the reverse lookup hostname (e.g., 4.4.8.8.in-addr.arpa.).
  2. > host ReverseDNSFromTheOutputOfFirstRequest
If the output matches the original IP address and the domain is associated with a trusted operator (e.g., Arquivo.pt), the user-agent can be considered legitimate.

IP list lookup technique

Some operators provide a public list of IP addresses used by their crawlers. This list can be cross-referenced to verify a user-agent's authenticity. However, both operators and website owners may find it challenging to maintain an up-to-date list, so use this method with caution and in conjunction with other verification techniques.