Published on 2025-08-07T06:18:08Z

PanguBot

PanguBot is a web crawler from the Chinese technology company Huawei. It functions as an AI data scraper, collecting public web content to be used as training data for Huawei's PanGu large language model (LLM). The bot is optimized to collect diverse training data, including multimedia content, and does not provide direct benefits, such as search traffic, to the websites it crawls.

What is PanguBot?

PanguBot is a web crawler operated by Huawei that serves as a data acquisition tool for its PanGu large language model (LLM), a multimodal AI system that can process text, images, and other data. As an AI data scraper, PanguBot's purpose is to download publicly available web content to be used for training Huawei's AI models. The bot identifies itself with the standardized user-agent string PanguBot. Its architecture is optimized for collecting diverse training data, and it shows a preference for multimedia content and sites with high information density.

Why is PanguBot crawling my site?

PanguBot is visiting your website to collect training data for Huawei's PanGu AI model. It looks for a wide range of content, with a particular interest in text, images, and technical documentation. The frequency of its visits depends on your site's content quality and update schedule. This crawling is part of Huawei's broader effort to improve its AI systems and, like many AI data scrapers, it accesses publicly available content that is freely accessible by default.

What is the purpose of PanguBot?

The purpose of PanguBot is to support Huawei's PanGu LLM by collecting diverse training data from the web. The content it gathers helps improve the AI model's capabilities in natural language processing and image recognition. For website owners, there is no direct benefit from the bot's crawling activity; its purpose is solely to improve Huawei's AI systems. This has led to concerns from some content creators about their work being used to train commercial AI systems without permission or compensation.

How do I block PanguBot?

To prevent PanguBot from using your website's content to train its AI models, you can add a disallow rule for it in your robots.txt file. This is the standard method for managing crawler access.

Add the following lines to your robots.txt file to block PanguBot:

User-agent: PanguBot
Disallow: /

How to verify the authenticity of the user-agent operated by Huawei?

Reverse IP lookup technique

To verify user-agent authenticity, you can use host linux command two times with the IP address of the requester.

```
> host IPAddressOfRequest
```
This command returns the reverse lookup hostname (e.g., 4.4.8.8.in-addr.arpa.).

> host ReverseDNSFromTheOutputOfFirstRequest

If the output matches the original IP address and the domain is associated with a trusted operator (e.g., Huawei), the user-agent can be considered legitimate.

IP list lookup technique

Some operators provide a public list of IP addresses used by their crawlers. This list can be cross-referenced to verify a user-agent's authenticity. However, both operators and website owners may find it challenging to maintain an up-to-date list, so use this method with caution and in conjunction with other verification techniques.