Published on 2025-08-07T06:18:08Z
Google-Extended
Google-Extended is not a web crawler itself, but a special control mechanism that allows website owners to manage whether their content can be used to train Google's generative AI models, such as Gemini and Vertex AI. By using a specific directive in a robots.txt
file, publishers can opt out of having their content used for AI training purposes without affecting their site's visibility or ranking in Google Search results. It was introduced by Google to give creators more control over their content in the age of AI.
What is Google-Extended?
Google-Extended is a control mechanism, not a separate web crawler, introduced by Google to give website owners control over the use of their content for training Google's AI models. It functions as a special token that can be used in a robots.txt
file. It works in conjunction with Google's existing crawlers, like Googlebot. While Googlebot still crawls the site for search indexing, the Google-Extended token in robots.txt
tells Google whether that crawled content can also be used to train and improve AI systems like Google Gemini and the Vertex AI APIs.
Why is Google-Extended relevant to my site's crawling?
Google-Extended does not crawl your site independently. Instead, it is a rule that Google's regular crawlers (like Googlebot) look for when they visit. If you have not explicitly blocked Google-Extended in your robots.txt
file, Google considers your content eligible for use in training its AI models. This process is part of Google's standard crawling operations. The frequency of these crawls is determined by your site's normal crawl budget, which is influenced by factors like your content update schedule and site authority.
What is the purpose of Google-Extended?
The primary purpose of Google-Extended is to act as a consent and control mechanism, giving publishers a clear choice about whether to contribute their content to the development of Google's AI ecosystem. It specifically governs the use of content for training Google Gemini and Vertex AI generative APIs. Google introduced this in response to publisher concerns about the use of their content for AI training. Importantly, opting out via Google-Extended is designed to have no negative impact on a site's inclusion or ranking in standard Google Search results.
How do I use Google-Extended to block AI training?
To prevent your website's content from being used to train Google's generative AI models, you need to add a specific rule to your robots.txt
file. This will opt your site out of data collection for this purpose without affecting your Google Search ranking.
To block content use for Google's AI models, add the following lines to your robots.txt
file:
User-agent: Google-Extended
Disallow: /
How to verify the authenticity of the user-agent operated by Google?
Reverse IP lookup technique
host
linux command two times with the IP address of the requester.-
This command returns the reverse lookup hostname (e.g., 4.4.8.8.in-addr.arpa.).> host IPAddressOfRequest
-
> host ReverseDNSFromTheOutputOfFirstRequest