Published on 2025-08-07T06:18:08Z

magpie-crawler

magpie-crawler is an intelligence-gathering web crawler for Brandwatch, a leading social media monitoring and digital consumer intelligence company. Its purpose is to scan public web pages, blogs, and forums to find mentions of specific keywords, brands, or products that Brandwatch's clients are tracking. It is a well-behaved crawler that respects robots.txt directives.

What is magpie-crawler?

magpie-crawler is a web crawler operated by Brandwatch, a social media monitoring and intelligence company. It functions as a conventional web scraper that systematically indexes public content from blogs, forums, news sites, and social media platforms. The crawler identifies itself in server logs with a user-agent string like magpie-crawler/1.1. It is designed for efficient data collection and adheres to standard web protocols, including respecting robots.txt, making it a legitimate and transparent crawler.

Why is magpie-crawler crawling my site?

magpie-crawler is visiting your website to collect public information that may be relevant to Brandwatch's clients. It is likely that your site contains content related to brands, products, or industry keywords that are being monitored. The crawler is particularly interested in content that expresses opinions or reviews. The frequency of its visits is determined by the relevance of your content to the monitoring needs of Brandwatch's clients. This crawling is considered authorized as it only accesses publicly available content.

What is the purpose of magpie-crawler?

The purpose of magpie-crawler is to support Brandwatch's social listening and digital consumer intelligence platform. The data it collects helps organizations monitor online mentions of their brands, analyze market trends, and gain insights into consumer behavior. For website owners, there is no direct benefit from being crawled, as the service is designed to benefit Brandwatch's clients. However, the crawler is designed to be respectful of server resources and should not cause performance issues.

How do I block magpie-crawler?

To prevent magpie-crawler from accessing your website, you can add a disallow rule for it in your robots.txt file. This is the standard method for managing access for legitimate web crawlers.

Add the following lines to your robots.txt file to block magpie-crawler:

User-agent: magpie-crawler
Disallow: /

How to verify the authenticity of the user-agent operated by Brandwatch?

Reverse IP lookup technique

To verify user-agent authenticity, you can use host linux command two times with the IP address of the requester.

```
> host IPAddressOfRequest
```
This command returns the reverse lookup hostname (e.g., 4.4.8.8.in-addr.arpa.).

> host ReverseDNSFromTheOutputOfFirstRequest

If the output matches the original IP address and the domain is associated with a trusted operator (e.g., Brandwatch), the user-agent can be considered legitimate.

IP list lookup technique

Some operators provide a public list of IP addresses used by their crawlers. This list can be cross-referenced to verify a user-agent's authenticity. However, both operators and website owners may find it challenging to maintain an up-to-date list, so use this method with caution and in conjunction with other verification techniques.