May 8, 2026 · 17 min read · Crawlantix

The Complete List of AI Bots Crawling Websites in 2026

A reference list of 60+ AI bot user agents active in 2026 — who operates them, what they're used for, how they behave, and whether they respect robots.txt.

As of 2026, there are over 60 distinct AI bot user agents actively crawling the web. Some belong to household names like OpenAI, Google, and Anthropic. Others are operated by less well-known companies building specialized models, AI-powered search engines, or data extraction pipelines.

This list covers every major AI crawler we’ve identified, organized by category. For each bot, we include the operator, primary purpose, known robots.txt compliance status, and behavioral notes based on real-world crawl data from AI Bot Tracker installations.

If you’re not sure which bots are hitting your site right now, install AI Bot Tracker (free) to get a real-time dashboard of all AI crawler activity.

How AI Bots Identify Themselves

Every web crawler sends a user-agent string — a text identifier included in each HTTP request header. Legitimate AI bots use distinctive user-agent strings that identify the operator and purpose. For example, OpenAI’s crawlers include “GPTBot” in their user-agent, and Anthropic’s include “ClaudeBot.”

This is how detection tools like AI Bot Tracker identify which AI company is crawling your site. The user-agent string is the first line of identification, though as we’ll cover later, not all bots identify themselves honestly.

Here’s what a typical AI bot user-agent looks like in your server logs:

Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)

The compatible; GPTBot/1.2 portion is what distinguishes this from a normal browser visit. The URL at the end points to documentation about the bot.

LLM Training Crawlers

These bots collect web content to build training datasets for large language models. They tend to crawl broadly and deeply — every public page on your site is a potential training sample.

Bot	Operator	Purpose	Respects robots.txt
GPTBot	OpenAI	Training data for GPT models	Yes
ChatGPT-User	OpenAI	Real-time web browsing for ChatGPT	Yes
OAI-SearchBot	OpenAI	SearchGPT web index	Yes
ClaudeBot	Anthropic	Training data for Claude models	Yes
Google-Extended	Google	Training data for Gemini models	Yes
Bytespider	ByteDance	Training data for ByteDance AI	Partial
CCBot	Common Crawl	Open-source training corpus	Yes
Meta-ExternalAgent	Meta	Training data for Llama models	Yes
Amazonbot	Amazon	Training data for Alexa and Amazon AI	Yes
FacebookBot	Meta	Content preview and model training	Partial
AI2Bot	Allen Institute for AI	Open research datasets	Yes
Diffbot	Diffbot	Structured web data extraction	Yes

GPTBot (OpenAI)

GPTBot is OpenAI’s primary training data crawler. It’s one of the most frequently seen AI bots across the web. OpenAI publishes a list of IP ranges for GPTBot, which makes it verifiable — you can confirm that a request claiming to be GPTBot actually originates from OpenAI’s infrastructure.

GPTBot respects robots.txt Disallow rules. Blocking it prevents your content from being used in future GPT model training, but does not retroactively remove content already collected. OpenAI also operates ChatGPT-User, a separate bot that fetches pages in real time when a ChatGPT user asks it to browse the web, and OAI-SearchBot for their SearchGPT product.

To block all three in robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

ClaudeBot (Anthropic)

Anthropic’s training data crawler for the Claude model family. ClaudeBot is generally well-behaved — it respects robots.txt, crawls at moderate rates, and identifies itself clearly. Anthropic also operates anthropic-ai, a separate user-agent used when Claude accesses the web as an AI agent (distinct from training data collection).

Google-Extended

Google-Extended is Google’s opt-out mechanism for Gemini model training. Unlike Googlebot (which you probably want to allow for search indexing), Google-Extended exclusively collects training data for Google’s AI products. You can block Google-Extended while keeping Googlebot active:

User-agent: Google-Extended
Disallow: /

User-agent: Googlebot
Allow: /

This is an important distinction — blocking Googlebot affects your search rankings, but blocking Google-Extended only affects AI training.

Bytespider (ByteDance)

Bytespider is ByteDance’s crawler, used to collect training data for their AI products. It has a reputation for aggressive crawl behavior — high request volumes, rapid-fire page fetches, and documented cases of ignoring robots.txt Disallow rules on some sites.

Bytespider consistently appears in AI Bot Tracker’s “most active crawlers” data across WordPress installations. If you see a single bot consuming disproportionate bandwidth on your site, Bytespider is often the culprit.

CCBot (Common Crawl)

CCBot powers Common Crawl, a nonprofit that maintains an open repository of web crawl data used by researchers and AI companies worldwide. Many AI training datasets are built on top of Common Crawl data, which means blocking CCBot has a downstream effect on multiple AI models — not just one company’s products.

CCBot crawls respectfully and respects robots.txt. However, its crawl data is publicly available, which means any company can use it for training without directly crawling your site.

Meta-ExternalAgent and FacebookBot

Meta operates two crawlers relevant to AI. Meta-ExternalAgent collects training data for Llama models and Meta’s AI products. FacebookBot handles content previews for link sharing on Facebook and Instagram but has also been associated with AI data collection. Both partially respect robots.txt — Meta-ExternalAgent has better compliance than FacebookBot in practice.

AI Search Engine Crawlers

These bots power AI-enhanced search products. Unlike training crawlers that collect data in bulk for model building, search crawlers fetch specific pages to generate real-time answers for user queries.

Bot	Operator	Purpose	Respects robots.txt
PerplexityBot	Perplexity AI	AI search and answer generation	Yes
YouBot	You.com	AI search engine index	Yes
Applebot-Extended	Apple	Apple Intelligence features	Yes
cohere-ai	Cohere	Enterprise AI search	Yes
DuckAssistBot	DuckDuckGo	AI-assisted search answers	Yes
PetalBot	Huawei	Petal Search AI features	Yes

PerplexityBot

Perplexity AI’s crawler fetches pages to generate cited answers in their AI search engine. Perplexity attributes sources in its responses, which creates a value exchange — your content gets cited and linked. This makes PerplexityBot one of the AI crawlers that many site owners choose to allow.

However, Perplexity has faced criticism for sometimes providing detailed summaries that reduce the need for users to click through to the source. Whether to allow PerplexityBot depends on whether you value the citation and referral traffic, or whether you see AI-generated summaries as substitutes for your original content.

Applebot-Extended

Apple’s opt-out mechanism for Apple Intelligence features, separate from the standard Applebot used for Siri and Spotlight search. Like Google-Extended, this lets you block AI training while keeping Apple’s search features intact.

AI Agent and Automation Crawlers

These bots operate on behalf of AI agent frameworks — systems where an AI model browses the web to complete tasks for users, rather than collecting training data or building a search index.

Bot	Operator	Purpose	Respects robots.txt
anthropic-ai	Anthropic	Claude agent web access	Yes
iaskspider	iAsk.Ai	AI Q&A platform	Yes
Webz.io	Webz.io	Web data-as-a-service for AI	Partial
Scrapy	Various	Open-source scraping framework	Configurable

Agent crawlers are a growing category. As AI assistants become more capable of browsing the web autonomously, the volume of agent-driven web requests is increasing. These visits look different from training crawls — they tend to be targeted (specific pages) rather than broad (entire sites), and they happen in real time in response to user prompts.

SEO and Analytics Bots With AI Features

Traditional SEO and analytics tools have added AI-powered features that require their crawlers to collect additional data. These bots were crawling the web before the AI era, but their scope has expanded.

Bot	Operator	Purpose	Respects robots.txt
SemrushBot	Semrush	SEO analytics and AI features	Yes
AhrefsBot	Ahrefs	Backlink analysis and AI tools	Yes
DataForSeoBot	DataForSEO	SEO data collection	Yes
BLEXBot	BLEXBot	Web analytics platform	Yes
DotBot	Moz	SEO analytics and domain authority	Yes
MJ12bot	Majestic	Link intelligence and analytics	Yes

These bots are generally well-behaved and respect robots.txt. Most site owners allow them because SEO tool access is mutually beneficial — you use these tools to analyze your own site, and they need to crawl your site to provide that data.

Content and Data Crawlers

Specialized crawlers that collect content for various AI applications — image recognition, sentiment analysis, content aggregation, and niche AI products.

Bot	Operator	Purpose	Respects robots.txt
ImagesiftBot	Imagesift	Image data for AI models	Partial
Kangaroo Bot	Kangaroo	AI content analysis	Unknown
Timpibot	Timpi	Decentralized search and AI	Yes
VelenPublicWebCrawler	Velen	Public web data collection	Yes
Omgili	Omgili	Discussion and forum content for AI	Yes
Seekport	Seekport	European search and AI index	Yes
SentiBot	SentiOne	Social listening and sentiment AI	Yes
Barkrowler	Babbar	Web link graph analysis	Yes
TurnitinBot	Turnitin	AI plagiarism detection	Yes

These crawlers vary widely in crawl volume. Some visit infrequently, while others (particularly image-focused bots) can generate significant traffic on media-heavy sites.

Aggressive and Non-Compliant Crawlers

Not all AI bots play by the rules. Some ignore robots.txt, disguise their identity, or crawl at rates that strain server resources.

Bot	Behavior
Bytespider	Documented ignoring Disallow on some sites; high crawl volume
Various unnamed	Use generic browser user-agents to evade detection
Residential proxy bots	Rotate through residential IPs to avoid IP-based blocks
Headless browser scrapers	Execute JavaScript and mimic real browsers

Disguised Crawlers

An increasing number of AI data collection operations use standard browser user-agent strings — Chrome, Firefox, Safari — to avoid detection. These crawlers are invisible to user-agent-based blocking because they look identical to a human visitor in your server logs.

Disguised crawlers are a significant problem because traditional blocking methods don’t work against them. You can’t block them with robots.txt (they don’t identify themselves as bots), and you can’t block them by user-agent (they use the same strings as real browsers).

The most effective defense against disguised crawlers is behavioral detection. Honeypot traps catch bots that follow hidden links no human would click. AI Bot Tracker’s honeypot feature embeds invisible links in your pages — when a crawler follows them, it’s flagged and can be automatically blocked, tarpitted, or shadowbanned.

Residential Proxy Networks

Some AI data collection operations route their requests through residential proxy networks — real home internet connections rented from proxy providers. This makes each request appear to come from a different residential IP address, defeating IP-based blocking entirely.

Residential proxy bots are the hardest category to detect and block. They use real browser user-agents, come from real IP addresses, and can even execute JavaScript. Behavioral detection (request patterns, honeypot activation, crawl timing) is the primary defense.

How to Detect AI Bots on Your Site

Most website analytics tools — Google Analytics, Plausible, Fathom — filter out bot traffic by design. This means AI crawlers are invisible in your analytics dashboard. You could be receiving hundreds of bot requests per day and have no idea.

There are three main methods for detecting AI bot activity:

Server Log Analysis

Your web server logs every request, including bot visits. You can search your access logs for known AI bot user-agent strings:

grep -i "gptbot\|claudebot\|bytespider\|perplexitybot" access.log

This works but requires command-line access, manual maintenance of bot signatures, and doesn’t catch disguised crawlers.

User-Agent Detection Plugins

Plugins like AI Bot Tracker maintain a database of known AI bot user-agent signatures and match incoming requests automatically. This is the simplest approach for WordPress — install, activate, and the dashboard shows you what’s crawling your site in real time.

Behavioral Detection

For bots that disguise their user-agent, honeypot-based detection catches crawlers based on behavior rather than identity. A hidden link that no human would see or click acts as a trap — anything that follows it is a bot.

Blocking Specific Bots With robots.txt

The simplest way to control AI bot access is through robots.txt. Here’s a comprehensive robots.txt configuration that blocks all major AI training crawlers while allowing search engines:

# Allow search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: PerplexityBot
Disallow: /

Remember: robots.txt is voluntary. Compliant bots respect it, but robots.txt alone doesn’t stop all AI bots — particularly those that disguise themselves or deliberately ignore directives.

Beyond robots.txt: Emerging Standards

Two newer web standards offer more granular control over AI access:

ai.txt lets you declare per-bot policies for training, summarization, and attribution — going beyond the simple allow/deny of robots.txt.
llms.txt provides a curated guide to your content, helping AI systems understand which pages are most important.

These complement robots.txt rather than replacing it.

How AI Bot Crawling Affects Your Site

AI bot traffic isn’t just an abstract concern — it has measurable effects on your server:

Bandwidth consumption: Each bot request transfers your page HTML. Across dozens of bots and hundreds of pages, this adds up to significant bandwidth usage.
Server load: Every request consumes PHP workers, database queries, and CPU time. Aggressive crawlers can compete with real visitors for server resources.
Cache pollution: Bots that crawl deep, rarely-visited pages can push your most popular pages out of server-side caches.
Hosting costs: If your hosting plan has bandwidth limits or charges for overages, bot traffic you didn’t consent to increases your bill.

The impact scales with your content volume. A 10-page brochure site barely notices AI bots. A blog with 500 posts, a WooCommerce store with thousands of products, or a documentation site with hundreds of pages can see meaningful resource consumption from AI crawlers alone.

How AI Bot Crawl Patterns Differ

Not all AI bots crawl the same way. Understanding crawl patterns helps you identify which bots are visiting even before checking user-agent data.

Training Crawlers: Wide and Deep

LLM training crawlers (GPTBot, ClaudeBot, CCBot, Bytespider) tend to crawl broadly. They want every public page on your site because each page is a potential training sample. These crawlers:

Follow sitemap links and internal navigation systematically
Crawl at all hours, often in bursts
Request full HTML but typically skip CSS, JavaScript, and images
Re-crawl periodically (weekly to monthly) to capture content changes

Search Crawlers: Targeted and Real-Time

AI search crawlers (PerplexityBot, YouBot, DuckAssistBot) fetch specific pages in response to user queries. Their patterns look different:

Requests are distributed throughout the day, correlated with user search activity
They tend to fetch specific articles or documentation pages, not entire sites
Crawl volume is proportional to how often your content is referenced in AI search results
They usually respect caching headers and avoid re-crawling unchanged pages

Agent Crawlers: Single Pages on Demand

AI agent crawlers (anthropic-ai) operate in real time — an AI assistant browses a specific page because a user asked it to. These visits are:

Highly targeted (one or two pages per session)
Unpredictable in timing (driven by end-user prompts)
Low in total volume but growing as AI agents become mainstream

SEO Crawlers: Consistent and Scheduled

SEO bots (SemrushBot, AhrefsBot, DataForSeoBot) have been crawling the web long before the AI era. They run on regular schedules and are generally well-throttled. Their crawl volume is predictable and rarely causes resource issues.

Frequently Asked Questions

How do I know if AI bots are crawling my site?

Standard analytics tools (Google Analytics, Plausible) filter out bot traffic by design. You need a dedicated detection tool like AI Bot Tracker or access to your raw server logs. AI Bot Tracker detects all bots listed on this page and provides a dashboard showing visit counts, timing, and honeypot activity.

Can I block all AI bots at once?

You can add a blanket robots.txt rule (User-agent: * / Disallow: /), but this blocks all crawlers including search engines — which would destroy your SEO. A better approach is blocking specific AI bots by name while keeping search engine access intact.

Do AI bots respect robots.txt?

Most major AI bots (GPTBot, ClaudeBot, Google-Extended, PerplexityBot) do respect robots.txt. However, roughly 13% of known AI bot user agents have been observed ignoring Disallow directives, and disguised crawlers never check robots.txt at all.

What’s the difference between GPTBot and ChatGPT-User?

GPTBot collects training data for future GPT models — it crawls your site to build OpenAI’s training corpus. ChatGPT-User browses specific pages in real time when a ChatGPT user asks it to visit a URL. Blocking GPTBot prevents training data collection; blocking ChatGPT-User prevents real-time browsing.

How do I block Google’s AI training but keep Google Search working?

Block Google-Extended (AI training for Gemini) while allowing Googlebot (search indexing). These are separate user agents with separate robots.txt directives, so you can control them independently.

What about bots that disguise themselves as browsers?

Disguised crawlers are invisible to user-agent-based detection. The most effective defense is honeypot detection — hidden links that only bots follow. When a crawler trips the honeypot, you know it’s a bot regardless of what user-agent it claims.

How much bandwidth do AI bots use?

It depends on your site’s size and which bots are crawling. A content-heavy WordPress site can see 400–750 MB of bot-only bandwidth per month from AI crawlers, with aggressive bots like Bytespider multiplying that significantly.

Should I block all AI bots or allow some?

This depends on your goals. Allowing PerplexityBot means your content can appear in Perplexity’s AI search results (with citations). Allowing ClaudeBot means your content may be used to train Claude, which could cite your expertise. Blocking Bytespider stops ByteDance from using your content without any clear benefit back to you. The right approach is making per-bot decisions rather than a blanket block-or-allow.

How This List Was Compiled

This list is based on user-agent signatures detected by AI Bot Tracker across thousands of WordPress installations, cross-referenced with public documentation from each AI company. New bots are added to AI Bot Tracker’s detection database as they appear in the wild.

The AI crawler landscape changes frequently. New bots appear as AI companies launch products, existing bots change their behavior, and some bots cease operations. We update this list periodically as our detection data evolves.

What Should You Do?

If your site publishes any content of value — articles, documentation, product pages, forum discussions — it’s almost certainly being crawled by multiple AI bots right now. Here’s how to approach it:

Get visibility first. Install AI Bot Tracker to see which bots are visiting and how often. The free Monitor tier detects all 60+ bots listed here.
Decide your policy. Not all AI crawling is bad. Some bots (PerplexityBot, Applebot-Extended) provide visibility for your content in AI products. Others (Bytespider, disguised crawlers) offer no clear benefit.
Block or manage problem bots. Use robots.txt for compliant bots. For aggressive or disguised crawlers, use response strategies like tarpitting, blocking, or shadowbanning.
Set up automated detection. Enable honeypot traps to catch bots that don’t identify themselves — the ones that robots.txt can’t reach.

The goal isn’t to block everything. It’s to make informed decisions about which AI systems get access to your content, and on what terms.

Try AI Bot Tracker — Free on WordPress.org

Detect, monitor, and respond to AI crawlers on your WordPress site. Full bot detection is free forever.

Download Free Plugin