The Complete List of AI Bots Crawling Websites in 2026

A reference list of 60+ AI bot user agents active in 2026 — who operates them, what they're used for, how they behave, and whether they respect robots.txt.

As of 2026, there are over 60 distinct AI bot user agents actively crawling the web. Some belong to household names like OpenAI, Google, and Anthropic. Others are operated by less well-known companies building specialized models, AI-powered search engines, or data extraction pipelines.

This list covers every major AI crawler we’ve identified, organized by category. For each bot, we include the operator, primary purpose, known robots.txt compliance status, and behavioral notes based on real-world crawl data from AI Bot Tracker installations.

If you’re not sure which bots are hitting your site right now, install AI Bot Tracker (free) to get a real-time dashboard of all AI crawler activity.

How AI Bots Identify Themselves

Every web crawler sends a user-agent string — a text identifier included in each HTTP request header. Legitimate AI bots use distinctive user-agent strings that identify the operator and purpose. For example, OpenAI’s crawlers include “GPTBot” in their user-agent, and Anthropic’s include “ClaudeBot.”

This is how detection tools like AI Bot Tracker identify which AI company is crawling your site. The user-agent string is the first line of identification, though as we’ll cover later, not all bots identify themselves honestly.

Here’s what a typical AI bot user-agent looks like in your server logs:

Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)

The compatible; GPTBot/1.2 portion is what distinguishes this from a normal browser visit. The URL at the end points to documentation about the bot.

LLM Training Crawlers

These bots collect web content to build training datasets for large language models. They tend to crawl broadly and deeply — every public page on your site is a potential training sample.

BotOperatorPurposeRespects robots.txt
GPTBotOpenAITraining data for GPT modelsYes
ChatGPT-UserOpenAIReal-time web browsing for ChatGPTYes
OAI-SearchBotOpenAISearchGPT web indexYes
ClaudeBotAnthropicTraining data for Claude modelsYes
Google-ExtendedGoogleTraining data for Gemini modelsYes
BytespiderByteDanceTraining data for ByteDance AIPartial
CCBotCommon CrawlOpen-source training corpusYes
Meta-ExternalAgentMetaTraining data for Llama modelsYes
AmazonbotAmazonTraining data for Alexa and Amazon AIYes
FacebookBotMetaContent preview and model trainingPartial
AI2BotAllen Institute for AIOpen research datasetsYes
DiffbotDiffbotStructured web data extractionYes

GPTBot (OpenAI)

GPTBot is OpenAI’s primary training data crawler. It’s one of the most frequently seen AI bots across the web. OpenAI publishes a list of IP ranges for GPTBot, which makes it verifiable — you can confirm that a request claiming to be GPTBot actually originates from OpenAI’s infrastructure.

GPTBot respects robots.txt Disallow rules. Blocking it prevents your content from being used in future GPT model training, but does not retroactively remove content already collected. OpenAI also operates ChatGPT-User, a separate bot that fetches pages in real time when a ChatGPT user asks it to browse the web, and OAI-SearchBot for their SearchGPT product.

To block all three in robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

ClaudeBot (Anthropic)

Anthropic’s training data crawler for the Claude model family. ClaudeBot is generally well-behaved — it respects robots.txt, crawls at moderate rates, and identifies itself clearly. Anthropic also operates anthropic-ai, a separate user-agent used when Claude accesses the web as an AI agent (distinct from training data collection).

Google-Extended

Google-Extended is Google’s opt-out mechanism for Gemini model training. Unlike Googlebot (which you probably want to allow for search indexing), Google-Extended exclusively collects training data for Google’s AI products. You can block Google-Extended while keeping Googlebot active:

User-agent: Google-Extended
Disallow: /

User-agent: Googlebot
Allow: /

This is an important distinction — blocking Googlebot affects your search rankings, but blocking Google-Extended only affects AI training.

Bytespider (ByteDance)

Bytespider is ByteDance’s crawler, used to collect training data for their AI products. It has a reputation for aggressive crawl behavior — high request volumes, rapid-fire page fetches, and documented cases of ignoring robots.txt Disallow rules on some sites.

Bytespider consistently appears in AI Bot Tracker’s “most active crawlers” data across WordPress installations. If you see a single bot consuming disproportionate bandwidth on your site, Bytespider is often the culprit.

CCBot (Common Crawl)

CCBot powers Common Crawl, a nonprofit that maintains an open repository of web crawl data used by researchers and AI companies worldwide. Many AI training datasets are built on top of Common Crawl data, which means blocking CCBot has a downstream effect on multiple AI models — not just one company’s products.

CCBot crawls respectfully and respects robots.txt. However, its crawl data is publicly available, which means any company can use it for training without directly crawling your site.

Meta-ExternalAgent and FacebookBot

Meta operates two crawlers relevant to AI. Meta-ExternalAgent collects training data for Llama models and Meta’s AI products. FacebookBot handles content previews for link sharing on Facebook and Instagram but has also been associated with AI data collection. Both partially respect robots.txt — Meta-ExternalAgent has better compliance than FacebookBot in practice.

AI Search Engine Crawlers

These bots power AI-enhanced search products. Unlike training crawlers that collect data in bulk for model building, search crawlers fetch specific pages to generate real-time answers for user queries.

BotOperatorPurposeRespects robots.txt
PerplexityBotPerplexity AIAI search and answer generationYes
YouBotYou.comAI search engine indexYes
Applebot-ExtendedAppleApple Intelligence featuresYes
cohere-aiCohereEnterprise AI searchYes
DuckAssistBotDuckDuckGoAI-assisted search answersYes
PetalBotHuaweiPetal Search AI featuresYes

PerplexityBot

Perplexity AI’s crawler fetches pages to generate cited answers in their AI search engine. Perplexity attributes sources in its responses, which creates a value exchange — your content gets cited and linked. This makes PerplexityBot one of the AI crawlers that many site owners choose to allow.

However, Perplexity has faced criticism for sometimes providing detailed summaries that reduce the need for users to click through to the source. Whether to allow PerplexityBot depends on whether you value the citation and referral traffic, or whether you see AI-generated summaries as substitutes for your original content.

Applebot-Extended

Apple’s opt-out mechanism for Apple Intelligence features, separate from the standard Applebot used for Siri and Spotlight search. Like Google-Extended, this lets you block AI training while keeping Apple’s search features intact.

AI Agent and Automation Crawlers

These bots operate on behalf of AI agent frameworks — systems where an AI model browses the web to complete tasks for users, rather than collecting training data or building a search index.

BotOperatorPurposeRespects robots.txt
anthropic-aiAnthropicClaude agent web accessYes
iaskspideriAsk.AiAI Q&A platformYes
Webz.ioWebz.ioWeb data-as-a-service for AIPartial
ScrapyVariousOpen-source scraping frameworkConfigurable

Agent crawlers are a growing category. As AI assistants become more capable of browsing the web autonomously, the volume of agent-driven web requests is increasing. These visits look different from training crawls — they tend to be targeted (specific pages) rather than broad (entire sites), and they happen in real time in response to user prompts.

SEO and Analytics Bots With AI Features

Traditional SEO and analytics tools have added AI-powered features that require their crawlers to collect additional data. These bots were crawling the web before the AI era, but their scope has expanded.

BotOperatorPurposeRespects robots.txt
SemrushBotSemrushSEO analytics and AI featuresYes
AhrefsBotAhrefsBacklink analysis and AI toolsYes
DataForSeoBotDataForSEOSEO data collectionYes
BLEXBotBLEXBotWeb analytics platformYes
DotBotMozSEO analytics and domain authorityYes
MJ12botMajesticLink intelligence and analyticsYes

These bots are generally well-behaved and respect robots.txt. Most site owners allow them because SEO tool access is mutually beneficial — you use these tools to analyze your own site, and they need to crawl your site to provide that data.

Content and Data Crawlers

Specialized crawlers that collect content for various AI applications — image recognition, sentiment analysis, content aggregation, and niche AI products.

BotOperatorPurposeRespects robots.txt
ImagesiftBotImagesiftImage data for AI modelsPartial
Kangaroo BotKangarooAI content analysisUnknown
TimpibotTimpiDecentralized search and AIYes
VelenPublicWebCrawlerVelenPublic web data collectionYes
OmgiliOmgiliDiscussion and forum content for AIYes
SeekportSeekportEuropean search and AI indexYes
SentiBotSentiOneSocial listening and sentiment AIYes
BarkrowlerBabbarWeb link graph analysisYes
TurnitinBotTurnitinAI plagiarism detectionYes

These crawlers vary widely in crawl volume. Some visit infrequently, while others (particularly image-focused bots) can generate significant traffic on media-heavy sites.

Aggressive and Non-Compliant Crawlers

Not all AI bots play by the rules. Some ignore robots.txt, disguise their identity, or crawl at rates that strain server resources.

BotBehavior
BytespiderDocumented ignoring Disallow on some sites; high crawl volume
Various unnamedUse generic browser user-agents to evade detection
Residential proxy botsRotate through residential IPs to avoid IP-based blocks
Headless browser scrapersExecute JavaScript and mimic real browsers

Disguised Crawlers

An increasing number of AI data collection operations use standard browser user-agent strings — Chrome, Firefox, Safari — to avoid detection. These crawlers are invisible to user-agent-based blocking because they look identical to a human visitor in your server logs.

Disguised crawlers are a significant problem because traditional blocking methods don’t work against them. You can’t block them with robots.txt (they don’t identify themselves as bots), and you can’t block them by user-agent (they use the same strings as real browsers).

The most effective defense against disguised crawlers is behavioral detection. Honeypot traps catch bots that follow hidden links no human would click. AI Bot Tracker’s honeypot feature embeds invisible links in your pages — when a crawler follows them, it’s flagged and can be automatically blocked, tarpitted, or shadowbanned.

Residential Proxy Networks

Some AI data collection operations route their requests through residential proxy networks — real home internet connections rented from proxy providers. This makes each request appear to come from a different residential IP address, defeating IP-based blocking entirely.

Residential proxy bots are the hardest category to detect and block. They use real browser user-agents, come from real IP addresses, and can even execute JavaScript. Behavioral detection (request patterns, honeypot activation, crawl timing) is the primary defense.

How to Detect AI Bots on Your Site

Most website analytics tools — Google Analytics, Plausible, Fathom — filter out bot traffic by design. This means AI crawlers are invisible in your analytics dashboard. You could be receiving hundreds of bot requests per day and have no idea.

There are three main methods for detecting AI bot activity:

Server Log Analysis

Your web server logs every request, including bot visits. You can search your access logs for known AI bot user-agent strings:

grep -i "gptbot\|claudebot\|bytespider\|perplexitybot" access.log

This works but requires command-line access, manual maintenance of bot signatures, and doesn’t catch disguised crawlers.

User-Agent Detection Plugins

Plugins like AI Bot Tracker maintain a database of known AI bot user-agent signatures and match incoming requests automatically. This is the simplest approach for WordPress — install, activate, and the dashboard shows you what’s crawling your site in real time.

Behavioral Detection

For bots that disguise their user-agent, honeypot-based detection catches crawlers based on behavior rather than identity. A hidden link that no human would see or click acts as a trap — anything that follows it is a bot.

Blocking Specific Bots With robots.txt

The simplest way to control AI bot access is through robots.txt. Here’s a comprehensive robots.txt configuration that blocks all major AI training crawlers while allowing search engines:

# Allow search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: PerplexityBot
Disallow: /

Remember: robots.txt is voluntary. Compliant bots respect it, but robots.txt alone doesn’t stop all AI bots — particularly those that disguise themselves or deliberately ignore directives.

Beyond robots.txt: Emerging Standards

Two newer web standards offer more granular control over AI access:

These complement robots.txt rather than replacing it.

How AI Bot Crawling Affects Your Site

AI bot traffic isn’t just an abstract concern — it has measurable effects on your server:

The impact scales with your content volume. A 10-page brochure site barely notices AI bots. A blog with 500 posts, a WooCommerce store with thousands of products, or a documentation site with hundreds of pages can see meaningful resource consumption from AI crawlers alone.

How AI Bot Crawl Patterns Differ

Not all AI bots crawl the same way. Understanding crawl patterns helps you identify which bots are visiting even before checking user-agent data.

Training Crawlers: Wide and Deep

LLM training crawlers (GPTBot, ClaudeBot, CCBot, Bytespider) tend to crawl broadly. They want every public page on your site because each page is a potential training sample. These crawlers:

Search Crawlers: Targeted and Real-Time

AI search crawlers (PerplexityBot, YouBot, DuckAssistBot) fetch specific pages in response to user queries. Their patterns look different:

Agent Crawlers: Single Pages on Demand

AI agent crawlers (anthropic-ai) operate in real time — an AI assistant browses a specific page because a user asked it to. These visits are:

SEO Crawlers: Consistent and Scheduled

SEO bots (SemrushBot, AhrefsBot, DataForSeoBot) have been crawling the web long before the AI era. They run on regular schedules and are generally well-throttled. Their crawl volume is predictable and rarely causes resource issues.

Frequently Asked Questions

How do I know if AI bots are crawling my site?

Standard analytics tools (Google Analytics, Plausible) filter out bot traffic by design. You need a dedicated detection tool like AI Bot Tracker or access to your raw server logs. AI Bot Tracker detects all bots listed on this page and provides a dashboard showing visit counts, timing, and honeypot activity.

Can I block all AI bots at once?

You can add a blanket robots.txt rule (User-agent: * / Disallow: /), but this blocks all crawlers including search engines — which would destroy your SEO. A better approach is blocking specific AI bots by name while keeping search engine access intact.

Do AI bots respect robots.txt?

Most major AI bots (GPTBot, ClaudeBot, Google-Extended, PerplexityBot) do respect robots.txt. However, roughly 13% of known AI bot user agents have been observed ignoring Disallow directives, and disguised crawlers never check robots.txt at all.

What’s the difference between GPTBot and ChatGPT-User?

GPTBot collects training data for future GPT models — it crawls your site to build OpenAI’s training corpus. ChatGPT-User browses specific pages in real time when a ChatGPT user asks it to visit a URL. Blocking GPTBot prevents training data collection; blocking ChatGPT-User prevents real-time browsing.

How do I block Google’s AI training but keep Google Search working?

Block Google-Extended (AI training for Gemini) while allowing Googlebot (search indexing). These are separate user agents with separate robots.txt directives, so you can control them independently.

What about bots that disguise themselves as browsers?

Disguised crawlers are invisible to user-agent-based detection. The most effective defense is honeypot detection — hidden links that only bots follow. When a crawler trips the honeypot, you know it’s a bot regardless of what user-agent it claims.

How much bandwidth do AI bots use?

It depends on your site’s size and which bots are crawling. A content-heavy WordPress site can see 400–750 MB of bot-only bandwidth per month from AI crawlers, with aggressive bots like Bytespider multiplying that significantly.

Should I block all AI bots or allow some?

This depends on your goals. Allowing PerplexityBot means your content can appear in Perplexity’s AI search results (with citations). Allowing ClaudeBot means your content may be used to train Claude, which could cite your expertise. Blocking Bytespider stops ByteDance from using your content without any clear benefit back to you. The right approach is making per-bot decisions rather than a blanket block-or-allow.

How This List Was Compiled

This list is based on user-agent signatures detected by AI Bot Tracker across thousands of WordPress installations, cross-referenced with public documentation from each AI company. New bots are added to AI Bot Tracker’s detection database as they appear in the wild.

The AI crawler landscape changes frequently. New bots appear as AI companies launch products, existing bots change their behavior, and some bots cease operations. We update this list periodically as our detection data evolves.

What Should You Do?

If your site publishes any content of value — articles, documentation, product pages, forum discussions — it’s almost certainly being crawled by multiple AI bots right now. Here’s how to approach it:

  1. Get visibility first. Install AI Bot Tracker to see which bots are visiting and how often. The free Monitor tier detects all 60+ bots listed here.
  2. Decide your policy. Not all AI crawling is bad. Some bots (PerplexityBot, Applebot-Extended) provide visibility for your content in AI products. Others (Bytespider, disguised crawlers) offer no clear benefit.
  3. Block or manage problem bots. Use robots.txt for compliant bots. For aggressive or disguised crawlers, use response strategies like tarpitting, blocking, or shadowbanning.
  4. Set up automated detection. Enable honeypot traps to catch bots that don’t identify themselves — the ones that robots.txt can’t reach.

The goal isn’t to block everything. It’s to make informed decisions about which AI systems get access to your content, and on what terms.

Try AI Bot Tracker — Free on WordPress.org

Detect, monitor, and respond to AI crawlers on your WordPress site. Full bot detection is free forever.

Download Free Plugin