Published on 2025-08-07T06:18:08Z

Scrapy bot

Scrapy is not a specific bot but a popular, open-source web crawling framework for the Python language. It is used by developers to build their own custom web crawlers (or 'spiders') to extract structured data from websites. Its presence in your logs means an individual or organization has built a scraper with this framework to target your site's content, often for data mining, price monitoring, or research.

What is the Scrapy bot?

Scrapy is an open-source web crawling framework, not a specific bot from a single company. It provides developers with the tools to build their own custom web crawlers, called 'spiders,' for extracting data from websites. When a Scrapy-based bot visits a site, it may identify itself with a default user-agent string like Scrapy/2.8.0 (+https://scrapy.org), though developers often customize this. Because it is a framework, the behavior of a Scrapy-based crawler is entirely determined by the person who programmed it.

Why is a Scrapy bot crawling my site?

A Scrapy bot is crawling your site because someone has developed a custom crawler with the Scrapy framework to extract specific information from your pages. The purpose could be anything from collecting product information for a price comparison service to gathering news articles for media monitoring. The frequency of visits is determined by the developer's programming. This crawling may be authorized if the developer is following ethical practices, but many Scrapy crawlers operate without the explicit permission of the website owner.

What is the purpose of a Scrapy bot?

Scrapy is a tool for building specialized data extraction solutions. The data collected by Scrapy-based crawlers is typically used for the private purposes of the operator, such as market research, price monitoring, or academic research. Unlike search engine bots that index content to make it discoverable, Scrapy crawlers usually extract specific data points for private use. While this can sometimes be indirectly beneficial if the data is used to include your site in a service that drives traffic back to you, it can also consume server resources with no return value.

How do I block a Scrapy bot?

You can attempt to block Scrapy-based bots by adding a disallow rule for the default user-agent to your robots.txt file. However, this is often ineffective, as developers using Scrapy frequently change the user-agent string to something more generic to avoid being blocked.

To block the default user-agent, add the following lines to your robots.txt file:

User-agent: Scrapy
Disallow: /

More advanced bot detection methods may be necessary to block custom Scrapy crawlers.

How to verify the authenticity of the user-agent operated by Zyte?

Reverse IP lookup technique

To verify user-agent authenticity, you can use host linux command two times with the IP address of the requester.

```
> host IPAddressOfRequest
```
This command returns the reverse lookup hostname (e.g., 4.4.8.8.in-addr.arpa.).

> host ReverseDNSFromTheOutputOfFirstRequest

If the output matches the original IP address and the domain is associated with a trusted operator (e.g., Zyte), the user-agent can be considered legitimate.

IP list lookup technique

Some operators provide a public list of IP addresses used by their crawlers. This list can be cross-referenced to verify a user-agent's authenticity. However, both operators and website owners may find it challenging to maintain an up-to-date list, so use this method with caution and in conjunction with other verification techniques.