Published on 2025-08-07T06:18:08Z

Google-Extended

Google-Extended is not a web crawler itself, but a special control mechanism that allows website owners to manage whether their content can be used to train Google's generative AI models, such as Gemini and Vertex AI. By using a specific directive in a robots.txt file, publishers can opt out of having their content used for AI training purposes without affecting their site's visibility or ranking in Google Search results. It was introduced by Google to give creators more control over their content in the age of AI.

What is Google-Extended?

Google-Extended is a control mechanism, not a separate web crawler, introduced by Google to give website owners control over the use of their content for training Google's AI models. It functions as a special token that can be used in a robots.txt file. It works in conjunction with Google's existing crawlers, like Googlebot. While Googlebot still crawls the site for search indexing, the Google-Extended token in robots.txt tells Google whether that crawled content can also be used to train and improve AI systems like Google Gemini and the Vertex AI APIs.

Why is Google-Extended relevant to my site's crawling?

Google-Extended does not crawl your site independently. Instead, it is a rule that Google's regular crawlers (like Googlebot) look for when they visit. If you have not explicitly blocked Google-Extended in your robots.txt file, Google considers your content eligible for use in training its AI models. This process is part of Google's standard crawling operations. The frequency of these crawls is determined by your site's normal crawl budget, which is influenced by factors like your content update schedule and site authority.

What is the purpose of Google-Extended?

The primary purpose of Google-Extended is to act as a consent and control mechanism, giving publishers a clear choice about whether to contribute their content to the development of Google's AI ecosystem. It specifically governs the use of content for training Google Gemini and Vertex AI generative APIs. Google introduced this in response to publisher concerns about the use of their content for AI training. Importantly, opting out via Google-Extended is designed to have no negative impact on a site's inclusion or ranking in standard Google Search results.

How do I use Google-Extended to block AI training?

To prevent your website's content from being used to train Google's generative AI models, you need to add a specific rule to your robots.txt file. This will opt your site out of data collection for this purpose without affecting your Google Search ranking.

To block content use for Google's AI models, add the following lines to your robots.txt file:

User-agent: Google-Extended
Disallow: /

How to verify the authenticity of the user-agent operated by Google?

Reverse IP lookup technique

To verify user-agent authenticity, you can use host linux command two times with the IP address of the requester.

```
> host IPAddressOfRequest
```
This command returns the reverse lookup hostname (e.g., 4.4.8.8.in-addr.arpa.).

> host ReverseDNSFromTheOutputOfFirstRequest

If the output matches the original IP address and the domain is associated with a trusted operator (e.g., Google), the user-agent can be considered legitimate.

IP list lookup technique

Some operators provide a public list of IP addresses used by their crawlers. This list can be cross-referenced to verify a user-agent's authenticity. However, both operators and website owners may find it challenging to maintain an up-to-date list, so use this method with caution and in conjunction with other verification techniques.