AI Crawlers Explained: The Complete List of 31 AI Bots Indexing Your Website

Key Takeaways

**Train foundation models.** GPTBot, for instance, collects data that [OpenAI](https://openai.com) uses to improve its models over time. This is the use case that generates the most debate around consent and copyright.
**Power real-time AI search.** OAI-SearchBot, PerplexityBot, and Google-Extended collect content to generate live, cited answers. This is where the direct visibility opportunity lies.
**Feed AI features in existing platforms.** Meta-ExternalAgent collects content to power Meta AI features across Facebook, Instagram, and WhatsApp. Bytespider feeds TikTok's AI capabilities.
OAI-SearchBot (ChatGPT Search)
PerplexityBot (Perplexity Search)

There are now over 30 AI crawlers actively scanning the web, and every major AI platform operates its own bot to collect the content that powers its answers. If your website blocks these crawlers, whether on purpose or by accident, your content becomes invisible to ChatGPT, Perplexity, Claude, Gemini, and every other AI engine that millions of people are turning to instead of Google. Understanding which bots exist, what they do, and how to manage access to them is no longer optional for any business that depends on being found online.

This is the most complete reference available for every known AI crawler operating in 2026. We cover all 31 bots, what they power, how to check whether your site is blocking them, and the exact robots.txt code you need to control access. At GetCited, we run crawler access checks as a core part of every site audit because this single technical detail can make or break your AI search visibility. Let's walk through all of it.

Why AI Crawlers Matter More Than You Think

Traditional web crawlers have been around since the early days of search. Googlebot, Bingbot, and their cousins visit your pages, index the content, and use it to determine your rankings in search results. You have been managing access to these bots for years through your robots.txt file, and the process is well understood.

AI crawlers work on a similar principle but serve a fundamentally different purpose. When GPTBot crawls your website, it is not building a search index in the traditional sense. It is collecting content that will be used to train large language models, power real-time AI search responses, or both. When someone asks ChatGPT a question about your industry, the quality and availability of your crawled content directly influences whether your brand shows up in that answer.

Here is the part that catches most people off guard: these AI crawlers respect robots.txt directives. That is good news because it means you have control. But it also means that a single misconfigured line in your robots.txt can silently cut you off from every AI engine on the planet.

According to data from the GetCited ebook on AI search readiness, 18.9% of websites are currently blocking AI crawlers without knowing it. Nearly one in five sites have robots.txt rules that prevent AI bots from accessing their content, and the site owners have no idea. They wonder why their brand never appears in ChatGPT or Perplexity responses, not realizing the answer is sitting in a text file on their own server.

The Difference Between AI Crawlers and Traditional Search Crawlers

Before diving into the full list, it helps to understand the key distinction. Traditional crawlers like Googlebot index your pages and rank them in a list of results. The user sees your title tag, clicks through, and lands on your site. The value exchange is straightforward: you provide content, Google sends you traffic.

AI crawlers collect your content for a different pipeline. That content might be used to:

Train foundation models. GPTBot, for instance, collects data that OpenAI uses to improve its models over time. This is the use case that generates the most debate around consent and copyright.
Power real-time AI search. OAI-SearchBot, PerplexityBot, and Google-Extended collect content to generate live, cited answers. This is where the direct visibility opportunity lies.
Feed AI features in existing platforms. Meta-ExternalAgent collects content to power Meta AI features across Facebook, Instagram, and WhatsApp. Bytespider feeds TikTok's AI capabilities.

Some crawlers serve a single purpose. Others serve multiple purposes. The critical thing to understand is that blocking a crawler removes you from everything that crawler feeds. Block GPTBot, and you disappear from both ChatGPT's training data and its ability to reference your content in conversations.

The Complete List of 31 AI Crawlers

Here is every known AI crawler operating as of early 2026, organized by the company that operates it. This table gives you the bot name (the exact string you need for your robots.txt file), the company behind it, and what it powers.

Crawler Name	Company/Platform	What It Powers
GPTBot	OpenAI	ChatGPT model training and general content collection
OAI-SearchBot	OpenAI	ChatGPT Search real-time results and citations
ChatGPT-User	OpenAI	ChatGPT's live browsing feature when users ask it to visit a URL
Google-Extended	Google/Alphabet	Gemini AI, Google AI Overviews, and Vertex AI training
GoogleOther	Google/Alphabet	General-purpose research and development crawling
GoogleOther-Image	Google/Alphabet	Image-specific AI training and research
GoogleOther-Video	Google/Alphabet	Video content AI training and research
PerplexityBot	Perplexity AI	Perplexity search engine responses and citations
ClaudeBot	Anthropic	Claude model training and content collection
anthropic-ai	Anthropic	Anthropic's general AI research crawling
Bytespider	ByteDance	TikTok AI features, Doubao chatbot, and model training
Meta-ExternalAgent	Meta	Meta AI across Facebook, Instagram, WhatsApp, and Messenger
meta-externalfetcher	Meta	Meta AI content retrieval for real-time responses
Amazonbot	Amazon	Alexa AI answers, Amazon search, and AWS AI services
AppleBot-Extended	Apple	Apple Intelligence, Siri AI features, and Safari AI tools
Cohere-Crawl	Cohere	Cohere language models and enterprise AI products
cohere-ai	Cohere	Cohere's general AI research crawling
Diffbot	Diffbot	Knowledge graph construction and AI data extraction
Timesbot	Perplexity AI	Secondary Perplexity crawler for news and time-sensitive content
Kangaroo Bot	Kangaroo LLM	Kangaroo language model training
CCBot	Common Crawl	Open-source training data used by many AI companies
DataForSeoBot	DataForSEO	SEO analytics and AI-powered search data tools
Scrapy	Various	Open-source framework used by multiple AI data collectors
PetalBot	Huawei	Huawei's AI search and Petal Search engine
Ai2Bot	Allen Institute for AI	AI2 research models and open-source AI training
Omgili	Webz.io	AI-powered content intelligence and data feeds
Youbot	You.com	You.com AI search engine responses
iaskspider	iAsk.AI	iAsk AI search engine answers
ISSCyberRiskCrawler	ISS	AI-powered cyber risk analysis and scanning
Seekr	Seekr	AI content evaluation and trust scoring
VelenpublicWebCrawler	Velen	AI data collection and model training

That is 31 crawlers, and the number continues to grow. Six months ago, this list had about 20 entries. A year from now, it will likely have 40 or more as new AI companies launch and existing ones spin up specialized bots for different purposes.

The Critical Crawlers You Need to Prioritize

Not all 31 crawlers carry equal weight. If you are going to focus your attention anywhere, these are the ones that matter most for AI search visibility in 2026.

GPTBot and OAI-SearchBot (OpenAI/ChatGPT)

GPTBot is the big one. With over 800 million weekly active users on ChatGPT, OpenAI's crawlers represent the single largest AI audience your content can reach. GPTBot handles general content collection for model training and improvement. OAI-SearchBot is the newer, more targeted crawler that specifically powers ChatGPT Search, the feature that generates real-time, cited answers when users ask questions.

If you block GPTBot, you lose access to ChatGPT's knowledge base. If you block OAI-SearchBot, you lose access to ChatGPT Search citations specifically. Many sites block one without realizing they also need to allow the other. You want both open.

There is also ChatGPT-User, which is the user-agent string that appears when a ChatGPT user asks the model to browse to a specific URL during a conversation. Blocking this prevents ChatGPT from accessing your content even when a user directly asks it to look at your page.

PerplexityBot (Perplexity AI)

Perplexity processes over 780 million queries per month and every single response includes numbered citations with clickable links back to source content. This makes PerplexityBot one of the highest-value AI crawlers in terms of direct traffic potential. When Perplexity cites your page, users can click through to your site. It is one of the few AI engines where the citation model actually sends real referral traffic.

Blocking PerplexityBot cuts you out of a search engine used by millions of people who are actively researching products, services, and topics with high purchase intent.

ClaudeBot (Anthropic/Claude)

ClaudeBot is Anthropic's crawler, and it feeds content into Claude, one of the most widely used AI assistants in enterprise and professional contexts. Claude is particularly popular among business users, developers, and researchers. If your audience skews professional or B2B, Claude visibility matters.

Anthropic also operates a secondary crawler under the user-agent string "anthropic-ai" for general research purposes.

Google-Extended (Google/Gemini)

Google-Extended controls whether your content is used for Gemini (Google's AI chatbot), AI Overviews in Google Search, and Vertex AI (Google's enterprise AI platform). This is separate from Googlebot, which handles traditional search indexing. You can allow Googlebot while blocking Google-Extended, which means your site appears in regular Google results but not in AI Overviews.

Given that AI Overviews now appear in roughly 60% of Google searches, blocking Google-Extended is a significant decision. You would be opting out of the feature that dominates the top of Google's results page for the majority of queries.

Bytespider (ByteDance/TikTok)

Bytespider is ByteDance's crawler, and it is one of the most aggressive in terms of crawl volume. It powers AI features across TikTok, the Doubao chatbot (popular in Asia), and ByteDance's various AI initiatives. Bytespider has a reputation for heavy crawling, and some site operators block it for performance reasons rather than strategic ones. If server load is a concern, you can manage its crawl rate with a Crawl-delay directive rather than blocking it entirely.

Meta-ExternalAgent (Meta)

Meta's AI features span Facebook, Instagram, WhatsApp, and Messenger, reaching billions of users. Meta-ExternalAgent is the crawler that feeds content into Meta AI. Blocking it means your content will not surface when people use AI features across Meta's entire family of apps. Given the sheer size of Meta's user base, this is a crawler worth keeping access open for.

Amazonbot

Amazonbot has been around for years, originally feeding Alexa's question-answering capabilities. It now also powers AI features across Amazon's ecosystem, including product-related AI search and AWS-powered AI services. For e-commerce businesses especially, Amazonbot access is worth maintaining.

AppleBot-Extended

Apple Intelligence launched in late 2024 and continues to expand across iPhone, iPad, and Mac. AppleBot-Extended is the crawler that feeds content into Apple's AI features, including enhanced Siri capabilities and Safari's AI tools. With over a billion active Apple devices worldwide, the potential reach here is enormous.

How to Check if Your Site Is Blocking AI Crawlers

This takes about 60 seconds. Here is the process.

Step 1: Open your robots.txt file.

Type your domain followed by /robots.txt in any browser. For example:

https://yourdomain.com/robots.txt

This file is publicly accessible on virtually every website. It is a plain text file that tells crawlers what they can and cannot access.

Step 2: Look for AI crawler directives.

Scan the file for any of the crawler names listed above. You are looking for entries like this:

User-agent: GPTBot
Disallow: /

That two-line block tells GPTBot it cannot access any page on your site. The forward slash after "Disallow:" means "everything." If you see this pattern for any of the AI crawlers listed above, that crawler is completely blocked from your site.

Step 3: Check for wildcard blocks.

Some robots.txt files use a wildcard to block all crawlers:

User-agent: *
Disallow: /

This blocks every crawler, including all AI bots. If your site has this, nothing can access your content. This is occasionally used intentionally on staging sites or internal tools, but if it is on your production site, it is almost certainly a mistake.

Step 4: Look for partial blocks.

Some sites block AI crawlers from specific sections:

User-agent: GPTBot
Disallow: /private/
Disallow: /internal/
Allow: /

This is actually a reasonable configuration. It blocks GPTBot from specific directories while allowing access to everything else. Partial blocks are fine when intentional.

The robots.txt Code to Allow All Major AI Crawlers

If you want to make sure every major AI crawler can access your site, here is the complete robots.txt configuration you need. You can add these directives to your existing robots.txt file.

# OpenAI Crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Google AI Crawlers
User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: GoogleOther-Image
Allow: /

User-agent: GoogleOther-Video
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Anthropic/Claude
User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

# ByteDance/TikTok
User-agent: Bytespider
Allow: /

# Meta
User-agent: Meta-ExternalAgent
Allow: /

User-agent: meta-externalfetcher
Allow: /

# Amazon
User-agent: Amazonbot
Allow: /

# Apple
User-agent: AppleBot-Extended
Allow: /

# Cohere
User-agent: Cohere-Crawl
Allow: /

User-agent: cohere-ai
Allow: /

# Other Notable AI Crawlers
User-agent: CCBot
Allow: /

User-agent: Diffbot
Allow: /

User-agent: PetalBot
Allow: /

User-agent: Ai2Bot
Allow: /

User-agent: Youbot
Allow: /

A few important notes about this configuration:

You do not need to explicitly allow crawlers that are not blocked. If your robots.txt does not mention a specific crawler, that crawler is allowed by default. The code above is useful if you have a wildcard block (User-agent: * with Disallow: /) and want to create exceptions for AI crawlers. It is also useful as documentation so your team knows exactly which bots have been considered.

If you want to block specific crawlers while allowing others, flip the directive. Replace Allow: / with Disallow: / for any crawler you want to keep out. Some site owners choose to block training-focused crawlers (like GPTBot and CCBot) while allowing search-focused crawlers (like OAI-SearchBot and PerplexityBot). This lets your content appear in AI search results without being used for model training. Whether this distinction holds up technically is debatable, but it is a valid strategic choice.

Order matters in robots.txt. When multiple rules could apply to the same crawler, the most specific rule wins. Place your AI crawler directives after any wildcard rules to ensure they override the defaults.

Selective Blocking: A Strategic Approach

Not everyone wants to open the doors to all 31 crawlers. There are legitimate reasons to block some while allowing others, and the decision should be strategic rather than reflexive.

The Case for Allowing All AI Crawlers

The strongest argument for full access is simple visibility. Every crawler you block is an audience you forfeit. With AI search usage growing at double-digit percentages quarter over quarter, the opportunity cost of blocking crawlers increases every month.

If your business depends on being discovered, cited, and referenced, whether you sell services, publish content, or operate an e-commerce store, maximum crawler access gives you maximum reach. The content is already public on your website. Blocking AI crawlers does not make it private. It just makes it invisible to the fastest-growing discovery channels on the internet.

The Case for Selective Blocking

There are scenarios where blocking specific crawlers makes sense:

Server performance. Some AI crawlers, Bytespider in particular, crawl aggressively and can strain server resources. If you run a smaller site with limited hosting capacity, you may need to block or rate-limit heavy crawlers to keep your site running smoothly for human visitors.

Content licensing concerns. If you produce premium content that is behind a paywall or sold through subscriptions, allowing training-focused crawlers to ingest that content for free raises legitimate business concerns. Blocking GPTBot and CCBot while allowing search-focused crawlers like OAI-SearchBot and PerplexityBot is one way to balance visibility with content protection.

Competitive intelligence. In some industries, the content on your site represents proprietary knowledge that you do not want feeding a competitor's AI tools. This is a niche concern but a real one in sectors like finance, legal, and specialized consulting.

A Practical Middle Ground

For most businesses, the smartest approach is to allow all search-focused AI crawlers (the ones that generate citations and drive traffic) while making a deliberate decision about training-focused crawlers. Here is what that looks like:

Always allow (search and citation focused): - OAI-SearchBot (ChatGPT Search) - PerplexityBot (Perplexity Search) - Google-Extended (AI Overviews) - ChatGPT-User (ChatGPT browsing)

Decide based on your situation (training and general purpose): - GPTBot (OpenAI model training) - ClaudeBot (Anthropic model training) - CCBot (Common Crawl, used by many AI companies) - Bytespider (ByteDance, high crawl volume)

Low risk to allow (smaller reach but minimal downside): - Amazonbot, AppleBot-Extended, Meta-ExternalAgent, Cohere-Crawl, and the remaining crawlers

How AI Crawlers Actually Behave

Understanding crawler behavior helps you manage them more effectively.

Crawl Frequency

AI crawlers do not all crawl at the same rate. Google-Extended piggybacks on Google's existing crawl infrastructure, so it visits frequently and efficiently. OAI-SearchBot tends to crawl in response to real-time queries, meaning it visits when someone asks ChatGPT a question that triggers a web search. PerplexityBot behaves similarly, crawling on demand as users submit queries.

Training-focused crawlers like GPTBot and CCBot tend to do periodic deep crawls, visiting large portions of your site in bursts rather than continuously.

What They Actually Collect

AI crawlers primarily collect text content. They parse your HTML, extract the readable text, and generally ignore JavaScript-rendered content unless they run a headless browser (most do not). This means that if your content is behind client-side JavaScript rendering without server-side fallbacks, many AI crawlers cannot see it at all.

They also collect: - Page titles and heading structure - Meta descriptions - Structured data (JSON-LD, schema markup) - Image alt text - Internal and external link structures - Publication and modification dates

Verification

You can verify that a crawler is legitimate by checking its IP address against the operating company's published ranges. OpenAI, Google, and other major operators publish the IP ranges their crawlers use. If a bot claims to be GPTBot but its IP does not match OpenAI's published range, it is a fake. This matters for security-conscious sites that want to allow specific crawlers without opening themselves up to scrapers impersonating those crawlers.

Common Mistakes That Block AI Crawlers

Based on the audits we see at GetCited, these are the most frequent mistakes that accidentally block AI crawlers.

The Overly Aggressive Wildcard Block

The most common culprit is a robots.txt file that was written years ago with a broad wildcard rule:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

This does not block AI crawlers, it only blocks access to wp-admin and wp-includes for all bots. But sometimes developers add more aggressive rules:

User-agent: *
Disallow: /

This blocks everything. It was probably added during development and never removed after launch. It happens more often than you would think.

CMS or Plugin Defaults

Some CMS platforms and security plugins add AI crawler blocks by default or as part of a "security hardening" package. WordPress security plugins are frequent offenders. They add blocks for crawlers they classify as "bad bots," and some of those lists include AI crawlers. Check your security plugin settings if you use one.

CDN or WAF Interference

Cloudflare, Sucuri, and other CDN/WAF providers sometimes block AI crawlers at the network level before they even reach your robots.txt file. This is separate from robots.txt and requires checking your CDN dashboard. Look for bot management rules that might be blocking or challenging AI crawler user-agent strings.

Outdated Block Lists

Some sites added AI crawler blocks in 2023 or 2024 when there was significant concern about AI companies using content without permission. The landscape has changed. AI search now represents real traffic and visibility opportunities. If you blocked crawlers two years ago, it is worth revisiting that decision with fresh eyes.

Monitoring AI Crawler Activity on Your Site

Knowing which AI crawlers visit your site and what they access gives you valuable intelligence.

Server Log Analysis

Your server access logs record every crawler visit, including the user-agent string, the pages accessed, the response codes, and the timestamps. Filter your logs for the AI crawler user-agent strings listed in this article to see exactly which bots are visiting, how often, and what they are looking at.

Most hosting providers give you access to raw logs through cPanel, Plesk, or a similar control panel. You can also use log analysis tools like GoAccess or Matomo to parse the data more easily.

Google Search Console

Google Search Console shows crawl stats for Google's bots, including Google-Extended. Check the crawl stats report to see how frequently Google's AI crawler visits your site and which pages it prioritizes.

Third-Party Monitoring

Several tools now offer AI crawler monitoring as a feature. GetCited checks crawler access as part of its audit process, identifying which AI bots can and cannot reach your content and flagging misconfigurations that hurt your AI visibility. This is particularly useful because it tests actual access rather than just reading your robots.txt. Sometimes a robots.txt file looks correct, but a CDN rule or server configuration blocks the crawler anyway.

The Relationship Between AI Crawlers and AI Citations

Allowing AI crawlers to access your site is necessary but not sufficient for appearing in AI responses. Think of it as the first gate. If the gate is closed, nothing else matters. But if the gate is open, the quality and structure of your content determines whether AI engines actually cite you.

Once a crawler can access your pages, the AI engine evaluates your content based on factors like:

Authority and trustworthiness. Does your site have a track record of accurate, well-sourced content?
Relevance and specificity. Does your content directly answer the question being asked?
Recency. Was the content published or updated recently?
Information density. Does the content contain specific facts, statistics, and data points?
Structure and clarity. Is the content organized with clear headings and direct answers?

Crawler access gets your content into the system. Content quality determines whether it gets cited in responses. Both pieces matter, and optimizing for both is what separates brands that appear in AI answers from those that do not.

What Happens When You Unblock AI Crawlers

If you discover that your site has been blocking AI crawlers and you remove those blocks, the effects are not instantaneous. Here is what to expect:

Within 24 to 48 hours: Most AI crawlers will detect the change in your robots.txt and begin accessing your content. OpenAI's crawlers tend to check robots.txt frequently.

Within 1 to 2 weeks: Your content starts appearing in the crawlers' indexes. Search-focused crawlers like OAI-SearchBot and PerplexityBot may begin referencing your content in responses.

Within 1 to 3 months: You should see a meaningful difference in AI search visibility, assuming your content is high quality and relevant to queries that AI engines receive.

The timeline varies based on your site's authority, the volume of content you publish, and how frequently your content gets updated. Sites that publish regularly and update existing pages tend to get indexed faster.

The Bigger Picture: AI Crawlers and the Future of Web Visibility

The number of AI crawlers will keep growing. Every new AI product needs training data and real-time content access. Every new AI search engine needs a way to find and retrieve information from across the web. The 31 crawlers on this list today will be 50 or more within a year.

For website owners and marketers, this means that crawler management is becoming a permanent part of your technical stack. It is not a one-time configuration. It is an ongoing process of monitoring new crawlers, evaluating their value, and making strategic decisions about access.

The companies that adapt fastest will have a structural advantage. While competitors fumble with outdated robots.txt files or blanket blocks they set and forgot, the businesses that actively manage crawler access and optimize their content for AI engines will capture a disproportionate share of AI-driven visibility.

The 18.9% of sites that are currently blocking AI crawlers without knowing it represent a real opportunity for everyone else. Every blocked competitor is a competitor who cannot appear in AI answers. Every day they remain blocked is another day your content can fill the gap.

Check your robots.txt. Fix what needs fixing. And start treating AI crawlers as what they are: the gatekeepers of the next generation of search.

Frequently Asked Questions

What is an AI crawler?

An AI crawler is an automated bot operated by an AI company that visits websites, reads their content, and collects that information for use in AI products. This includes training language models (like ChatGPT and Claude), powering AI search engines (like Perplexity), and generating AI-assisted search results (like Google AI Overviews). AI crawlers identify themselves through unique user-agent strings in your server logs and respect the access rules you set in your robots.txt file.

How do I know if my website is blocking AI crawlers?

Go to yourdomain.com/robots.txt in any browser and look for entries that mention AI crawler names like GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, or Google-Extended. If you see "Disallow: /" under any of these user-agent names, that crawler is blocked from your entire site. Also check for a wildcard block (User-agent: * with Disallow: /) which would block all crawlers including AI bots. If you want a more thorough check, services like GetCited test actual crawler access rather than just reading the robots.txt file, which catches issues caused by CDN or firewall rules that robots.txt alone would not reveal.

Should I block or allow AI crawlers?

For most businesses, allowing AI crawlers is the better strategic choice. Blocking them removes your content from AI-powered search engines and chatbots used by hundreds of millions of people. The main reasons to block specific crawlers are server performance concerns (some crawlers are aggressive), content licensing issues (if you sell premium content), or specific competitive concerns. A practical middle ground is to allow all search-focused crawlers (OAI-SearchBot, PerplexityBot, Google-Extended) while making a case-by-case decision about training-focused crawlers (GPTBot, CCBot, ClaudeBot).

What is the difference between GPTBot and OAI-SearchBot?

GPTBot is OpenAI's general-purpose crawler used for collecting content that improves ChatGPT's underlying models over time. OAI-SearchBot is a newer, separate crawler used specifically to power ChatGPT Search, the feature where ChatGPT searches the web in real time and provides cited answers. You can block one while allowing the other. Blocking GPTBot prevents your content from being used in model training. Blocking OAI-SearchBot prevents your content from appearing in ChatGPT Search results. Many sites that want maximum visibility allow both, while sites concerned about training data usage block GPTBot but allow OAI-SearchBot.

Does allowing AI crawlers guarantee my content will appear in AI responses?

No. Allowing crawler access is the first step, but it does not guarantee citations or mentions in AI responses. Think of it as opening the door. The AI still needs a reason to walk through it. Once crawlers can access your content, the AI engine evaluates it for authority, relevance, recency, information density, and structural clarity. Content that provides specific data points, direct answers to common questions, and well-organized information is far more likely to be cited than content that is vague or generic. Crawler access is necessary but not sufficient. You need both technical access and content quality to consistently appear in AI-generated answers.