As of 2026, there are over 60 distinct AI bot user agents actively crawling the web. Some belong to household names like OpenAI, Google, and Anthropic. Others are operated by less well-known companies building specialized models, AI-powered search engines, or data extraction pipelines.
This list covers every major AI crawler we’ve identified, organized by category. For each bot, we include the operator, primary purpose, known robots.txt compliance status, and behavioral notes based on real-world crawl data from AI Bot Tracker installations.
If you’re not sure which bots are hitting your site right now, install AI Bot Tracker (free) to get a real-time dashboard of all AI crawler activity.
How AI Bots Identify Themselves
Every web crawler sends a user-agent string — a text identifier included in each HTTP request header. Legitimate AI bots use distinctive user-agent strings that identify the operator and purpose. For example, OpenAI’s crawlers include “GPTBot” in their user-agent, and Anthropic’s include “ClaudeBot.”
This is how detection tools like AI Bot Tracker identify which AI company is crawling your site. The user-agent string is the first line of identification, though as we’ll cover later, not all bots identify themselves honestly.
Here’s what a typical AI bot user-agent looks like in your server logs:
Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)
The compatible; GPTBot/1.2 portion is what distinguishes this from a normal browser visit. The URL at the end points to documentation about the bot.
LLM Training Crawlers
These bots collect web content to build training datasets for large language models. They tend to crawl broadly and deeply — every public page on your site is a potential training sample.
| Bot | Operator | Purpose | Respects robots.txt |
|---|---|---|---|
| GPTBot | OpenAI | Training data for GPT models | Yes |
| ChatGPT-User | OpenAI | Real-time web browsing for ChatGPT | Yes |
| OAI-SearchBot | OpenAI | SearchGPT web index | Yes |
| ClaudeBot | Anthropic | Training data for Claude models | Yes |
| Google-Extended | Training data for Gemini models | Yes | |
| Bytespider | ByteDance | Training data for ByteDance AI | Partial |
| CCBot | Common Crawl | Open-source training corpus | Yes |
| Meta-ExternalAgent | Meta | Training data for Llama models | Yes |
| Amazonbot | Amazon | Training data for Alexa and Amazon AI | Yes |
| FacebookBot | Meta | Content preview and model training | Partial |
| AI2Bot | Allen Institute for AI | Open research datasets | Yes |
| Diffbot | Diffbot | Structured web data extraction | Yes |
GPTBot (OpenAI)
GPTBot is OpenAI’s primary training data crawler. It’s one of the most frequently seen AI bots across the web. OpenAI publishes a list of IP ranges for GPTBot, which makes it verifiable — you can confirm that a request claiming to be GPTBot actually originates from OpenAI’s infrastructure.
GPTBot respects robots.txt Disallow rules. Blocking it prevents your content from being used in future GPT model training, but does not retroactively remove content already collected. OpenAI also operates ChatGPT-User, a separate bot that fetches pages in real time when a ChatGPT user asks it to browse the web, and OAI-SearchBot for their SearchGPT product.
To block all three in robots.txt:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
ClaudeBot (Anthropic)
Anthropic’s training data crawler for the Claude model family. ClaudeBot is generally well-behaved — it respects robots.txt, crawls at moderate rates, and identifies itself clearly. Anthropic also operates anthropic-ai, a separate user-agent used when Claude accesses the web as an AI agent (distinct from training data collection).
Google-Extended
Google-Extended is Google’s opt-out mechanism for Gemini model training. Unlike Googlebot (which you probably want to allow for search indexing), Google-Extended exclusively collects training data for Google’s AI products. You can block Google-Extended while keeping Googlebot active:
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Allow: /
This is an important distinction — blocking Googlebot affects your search rankings, but blocking Google-Extended only affects AI training.
Bytespider (ByteDance)
Bytespider is ByteDance’s crawler, used to collect training data for their AI products. It has a reputation for aggressive crawl behavior — high request volumes, rapid-fire page fetches, and documented cases of ignoring robots.txt Disallow rules on some sites.
Bytespider consistently appears in AI Bot Tracker’s “most active crawlers” data across WordPress installations. If you see a single bot consuming disproportionate bandwidth on your site, Bytespider is often the culprit.
CCBot (Common Crawl)
CCBot powers Common Crawl, a nonprofit that maintains an open repository of web crawl data used by researchers and AI companies worldwide. Many AI training datasets are built on top of Common Crawl data, which means blocking CCBot has a downstream effect on multiple AI models — not just one company’s products.
CCBot crawls respectfully and respects robots.txt. However, its crawl data is publicly available, which means any company can use it for training without directly crawling your site.
Meta-ExternalAgent and FacebookBot
Meta operates two crawlers relevant to AI. Meta-ExternalAgent collects training data for Llama models and Meta’s AI products. FacebookBot handles content previews for link sharing on Facebook and Instagram but has also been associated with AI data collection. Both partially respect robots.txt — Meta-ExternalAgent has better compliance than FacebookBot in practice.
AI Search Engine Crawlers
These bots power AI-enhanced search products. Unlike training crawlers that collect data in bulk for model building, search crawlers fetch specific pages to generate real-time answers for user queries.
| Bot | Operator | Purpose | Respects robots.txt |
|---|---|---|---|
| PerplexityBot | Perplexity AI | AI search and answer generation | Yes |
| YouBot | You.com | AI search engine index | Yes |
| Applebot-Extended | Apple | Apple Intelligence features | Yes |
| cohere-ai | Cohere | Enterprise AI search | Yes |
| DuckAssistBot | DuckDuckGo | AI-assisted search answers | Yes |
| PetalBot | Huawei | Petal Search AI features | Yes |
PerplexityBot
Perplexity AI’s crawler fetches pages to generate cited answers in their AI search engine. Perplexity attributes sources in its responses, which creates a value exchange — your content gets cited and linked. This makes PerplexityBot one of the AI crawlers that many site owners choose to allow.
However, Perplexity has faced criticism for sometimes providing detailed summaries that reduce the need for users to click through to the source. Whether to allow PerplexityBot depends on whether you value the citation and referral traffic, or whether you see AI-generated summaries as substitutes for your original content.
Applebot-Extended
Apple’s opt-out mechanism for Apple Intelligence features, separate from the standard Applebot used for Siri and Spotlight search. Like Google-Extended, this lets you block AI training while keeping Apple’s search features intact.
AI Agent and Automation Crawlers
These bots operate on behalf of AI agent frameworks — systems where an AI model browses the web to complete tasks for users, rather than collecting training data or building a search index.
| Bot | Operator | Purpose | Respects robots.txt |
|---|---|---|---|
| anthropic-ai | Anthropic | Claude agent web access | Yes |
| iaskspider | iAsk.Ai | AI Q&A platform | Yes |
| Webz.io | Webz.io | Web data-as-a-service for AI | Partial |
| Scrapy | Various | Open-source scraping framework | Configurable |
Agent crawlers are a growing category. As AI assistants become more capable of browsing the web autonomously, the volume of agent-driven web requests is increasing. These visits look different from training crawls — they tend to be targeted (specific pages) rather than broad (entire sites), and they happen in real time in response to user prompts.
SEO and Analytics Bots With AI Features
Traditional SEO and analytics tools have added AI-powered features that require their crawlers to collect additional data. These bots were crawling the web before the AI era, but their scope has expanded.
| Bot | Operator | Purpose | Respects robots.txt |
|---|---|---|---|
| SemrushBot | Semrush | SEO analytics and AI features | Yes |
| AhrefsBot | Ahrefs | Backlink analysis and AI tools | Yes |
| DataForSeoBot | DataForSEO | SEO data collection | Yes |
| BLEXBot | BLEXBot | Web analytics platform | Yes |
| DotBot | Moz | SEO analytics and domain authority | Yes |
| MJ12bot | Majestic | Link intelligence and analytics | Yes |
These bots are generally well-behaved and respect robots.txt. Most site owners allow them because SEO tool access is mutually beneficial — you use these tools to analyze your own site, and they need to crawl your site to provide that data.
Content and Data Crawlers
Specialized crawlers that collect content for various AI applications — image recognition, sentiment analysis, content aggregation, and niche AI products.
| Bot | Operator | Purpose | Respects robots.txt |
|---|---|---|---|
| ImagesiftBot | Imagesift | Image data for AI models | Partial |
| Kangaroo Bot | Kangaroo | AI content analysis | Unknown |
| Timpibot | Timpi | Decentralized search and AI | Yes |
| VelenPublicWebCrawler | Velen | Public web data collection | Yes |
| Omgili | Omgili | Discussion and forum content for AI | Yes |
| Seekport | Seekport | European search and AI index | Yes |
| SentiBot | SentiOne | Social listening and sentiment AI | Yes |
| Barkrowler | Babbar | Web link graph analysis | Yes |
| TurnitinBot | Turnitin | AI plagiarism detection | Yes |
These crawlers vary widely in crawl volume. Some visit infrequently, while others (particularly image-focused bots) can generate significant traffic on media-heavy sites.
Aggressive and Non-Compliant Crawlers
Not all AI bots play by the rules. Some ignore robots.txt, disguise their identity, or crawl at rates that strain server resources.
| Bot | Behavior |
|---|---|
| Bytespider | Documented ignoring Disallow on some sites; high crawl volume |
| Various unnamed | Use generic browser user-agents to evade detection |
| Residential proxy bots | Rotate through residential IPs to avoid IP-based blocks |
| Headless browser scrapers | Execute JavaScript and mimic real browsers |
Disguised Crawlers
An increasing number of AI data collection operations use standard browser user-agent strings — Chrome, Firefox, Safari — to avoid detection. These crawlers are invisible to user-agent-based blocking because they look identical to a human visitor in your server logs.
Disguised crawlers are a significant problem because traditional blocking methods don’t work against them. You can’t block them with robots.txt (they don’t identify themselves as bots), and you can’t block them by user-agent (they use the same strings as real browsers).
The most effective defense against disguised crawlers is behavioral detection. Honeypot traps catch bots that follow hidden links no human would click. AI Bot Tracker’s honeypot feature embeds invisible links in your pages — when a crawler follows them, it’s flagged and can be automatically blocked, tarpitted, or shadowbanned.
Residential Proxy Networks
Some AI data collection operations route their requests through residential proxy networks — real home internet connections rented from proxy providers. This makes each request appear to come from a different residential IP address, defeating IP-based blocking entirely.
Residential proxy bots are the hardest category to detect and block. They use real browser user-agents, come from real IP addresses, and can even execute JavaScript. Behavioral detection (request patterns, honeypot activation, crawl timing) is the primary defense.
How to Detect AI Bots on Your Site
Most website analytics tools — Google Analytics, Plausible, Fathom — filter out bot traffic by design. This means AI crawlers are invisible in your analytics dashboard. You could be receiving hundreds of bot requests per day and have no idea.
There are three main methods for detecting AI bot activity:
Server Log Analysis
Your web server logs every request, including bot visits. You can search your access logs for known AI bot user-agent strings:
grep -i "gptbot\|claudebot\|bytespider\|perplexitybot" access.log
This works but requires command-line access, manual maintenance of bot signatures, and doesn’t catch disguised crawlers.
User-Agent Detection Plugins
Plugins like AI Bot Tracker maintain a database of known AI bot user-agent signatures and match incoming requests automatically. This is the simplest approach for WordPress — install, activate, and the dashboard shows you what’s crawling your site in real time.
Behavioral Detection
For bots that disguise their user-agent, honeypot-based detection catches crawlers based on behavior rather than identity. A hidden link that no human would see or click acts as a trap — anything that follows it is a bot.
Blocking Specific Bots With robots.txt
The simplest way to control AI bot access is through robots.txt. Here’s a comprehensive robots.txt configuration that blocks all major AI training crawlers while allowing search engines:
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: PerplexityBot
Disallow: /
Remember: robots.txt is voluntary. Compliant bots respect it, but robots.txt alone doesn’t stop all AI bots — particularly those that disguise themselves or deliberately ignore directives.
Beyond robots.txt: Emerging Standards
Two newer web standards offer more granular control over AI access:
- ai.txt lets you declare per-bot policies for training, summarization, and attribution — going beyond the simple allow/deny of robots.txt.
- llms.txt provides a curated guide to your content, helping AI systems understand which pages are most important.
These complement robots.txt rather than replacing it.
How AI Bot Crawling Affects Your Site
AI bot traffic isn’t just an abstract concern — it has measurable effects on your server:
- Bandwidth consumption: Each bot request transfers your page HTML. Across dozens of bots and hundreds of pages, this adds up to significant bandwidth usage.
- Server load: Every request consumes PHP workers, database queries, and CPU time. Aggressive crawlers can compete with real visitors for server resources.
- Cache pollution: Bots that crawl deep, rarely-visited pages can push your most popular pages out of server-side caches.
- Hosting costs: If your hosting plan has bandwidth limits or charges for overages, bot traffic you didn’t consent to increases your bill.
The impact scales with your content volume. A 10-page brochure site barely notices AI bots. A blog with 500 posts, a WooCommerce store with thousands of products, or a documentation site with hundreds of pages can see meaningful resource consumption from AI crawlers alone.
How AI Bot Crawl Patterns Differ
Not all AI bots crawl the same way. Understanding crawl patterns helps you identify which bots are visiting even before checking user-agent data.
Training Crawlers: Wide and Deep
LLM training crawlers (GPTBot, ClaudeBot, CCBot, Bytespider) tend to crawl broadly. They want every public page on your site because each page is a potential training sample. These crawlers:
- Follow sitemap links and internal navigation systematically
- Crawl at all hours, often in bursts
- Request full HTML but typically skip CSS, JavaScript, and images
- Re-crawl periodically (weekly to monthly) to capture content changes
Search Crawlers: Targeted and Real-Time
AI search crawlers (PerplexityBot, YouBot, DuckAssistBot) fetch specific pages in response to user queries. Their patterns look different:
- Requests are distributed throughout the day, correlated with user search activity
- They tend to fetch specific articles or documentation pages, not entire sites
- Crawl volume is proportional to how often your content is referenced in AI search results
- They usually respect caching headers and avoid re-crawling unchanged pages
Agent Crawlers: Single Pages on Demand
AI agent crawlers (anthropic-ai) operate in real time — an AI assistant browses a specific page because a user asked it to. These visits are:
- Highly targeted (one or two pages per session)
- Unpredictable in timing (driven by end-user prompts)
- Low in total volume but growing as AI agents become mainstream
SEO Crawlers: Consistent and Scheduled
SEO bots (SemrushBot, AhrefsBot, DataForSeoBot) have been crawling the web long before the AI era. They run on regular schedules and are generally well-throttled. Their crawl volume is predictable and rarely causes resource issues.
Frequently Asked Questions
How do I know if AI bots are crawling my site?
Standard analytics tools (Google Analytics, Plausible) filter out bot traffic by design. You need a dedicated detection tool like AI Bot Tracker or access to your raw server logs. AI Bot Tracker detects all bots listed on this page and provides a dashboard showing visit counts, timing, and honeypot activity.
Can I block all AI bots at once?
You can add a blanket robots.txt rule (User-agent: * / Disallow: /), but this blocks all crawlers including search engines — which would destroy your SEO. A better approach is blocking specific AI bots by name while keeping search engine access intact.
Do AI bots respect robots.txt?
Most major AI bots (GPTBot, ClaudeBot, Google-Extended, PerplexityBot) do respect robots.txt. However, roughly 13% of known AI bot user agents have been observed ignoring Disallow directives, and disguised crawlers never check robots.txt at all.
What’s the difference between GPTBot and ChatGPT-User?
GPTBot collects training data for future GPT models — it crawls your site to build OpenAI’s training corpus. ChatGPT-User browses specific pages in real time when a ChatGPT user asks it to visit a URL. Blocking GPTBot prevents training data collection; blocking ChatGPT-User prevents real-time browsing.
How do I block Google’s AI training but keep Google Search working?
Block Google-Extended (AI training for Gemini) while allowing Googlebot (search indexing). These are separate user agents with separate robots.txt directives, so you can control them independently.
What about bots that disguise themselves as browsers?
Disguised crawlers are invisible to user-agent-based detection. The most effective defense is honeypot detection — hidden links that only bots follow. When a crawler trips the honeypot, you know it’s a bot regardless of what user-agent it claims.
How much bandwidth do AI bots use?
It depends on your site’s size and which bots are crawling. A content-heavy WordPress site can see 400–750 MB of bot-only bandwidth per month from AI crawlers, with aggressive bots like Bytespider multiplying that significantly.
Should I block all AI bots or allow some?
This depends on your goals. Allowing PerplexityBot means your content can appear in Perplexity’s AI search results (with citations). Allowing ClaudeBot means your content may be used to train Claude, which could cite your expertise. Blocking Bytespider stops ByteDance from using your content without any clear benefit back to you. The right approach is making per-bot decisions rather than a blanket block-or-allow.
How This List Was Compiled
This list is based on user-agent signatures detected by AI Bot Tracker across thousands of WordPress installations, cross-referenced with public documentation from each AI company. New bots are added to AI Bot Tracker’s detection database as they appear in the wild.
The AI crawler landscape changes frequently. New bots appear as AI companies launch products, existing bots change their behavior, and some bots cease operations. We update this list periodically as our detection data evolves.
What Should You Do?
If your site publishes any content of value — articles, documentation, product pages, forum discussions — it’s almost certainly being crawled by multiple AI bots right now. Here’s how to approach it:
- Get visibility first. Install AI Bot Tracker to see which bots are visiting and how often. The free Monitor tier detects all 60+ bots listed here.
- Decide your policy. Not all AI crawling is bad. Some bots (PerplexityBot, Applebot-Extended) provide visibility for your content in AI products. Others (Bytespider, disguised crawlers) offer no clear benefit.
- Block or manage problem bots. Use robots.txt for compliant bots. For aggressive or disguised crawlers, use response strategies like tarpitting, blocking, or shadowbanning.
- Set up automated detection. Enable honeypot traps to catch bots that don’t identify themselves — the ones that robots.txt can’t reach.
The goal isn’t to block everything. It’s to make informed decisions about which AI systems get access to your content, and on what terms.