Open Training Scraper
Common Crawl (CCBot) Access Checker
An open repository of web crawl data used widely by researchers and SaaS startups to train various custom AI models.
Verify Robots.txt blocks for Common Crawl (CCBot)
Enter your domain to run a live AI crawl check auditing user agent directives.
Official User-Agent String
CCBot/2.0 (compatible; +http://commoncrawl.org/faq/)Verification Directives
To Block Crawler (Disallow)
User-agent: CCBot Disallow: /
To Allow Crawler (Allow)
User-agent: CCBot Allow: /
Common Crawl (CCBot) Search Optimization FAQs
Q: What is CCBot / Common Crawl?
CCBot is the crawler for Common Crawl, a non-profit organization that maintains a massive open archive of web data.
Q: Should I block CCBot?
Many open-source LLMs use Common Crawl data for training. If you want to protect your intellectual property from generic open AI training datasets, blocking CCBot is highly recommended.