Open Training Scraper

Common Crawl (CCBot) Access Checker

An open repository of web crawl data used widely by researchers and SaaS startups to train various custom AI models.

Verify Robots.txt blocks for Common Crawl (CCBot)

Enter your domain to run a live AI crawl check auditing user agent directives.

https://

Official User-Agent String

CCBot/2.0 (compatible; +http://commoncrawl.org/faq/)

Verification Directives

To Block Crawler (Disallow)
User-agent: CCBot
Disallow: /
To Allow Crawler (Allow)
User-agent: CCBot
Allow: /

Common Crawl (CCBot) Search Optimization FAQs

Q: What is CCBot / Common Crawl?

CCBot is the crawler for Common Crawl, a non-profit organization that maintains a massive open archive of web data.

Q: Should I block CCBot?

Many open-source LLMs use Common Crawl data for training. If you want to protect your intellectual property from generic open AI training datasets, blocking CCBot is highly recommended.