Getting Started with Cloudflare's new AI Crawl Control

Share

 

AI Crawl Control 
AI Crawl Control From Cloudflare
  

  Cloudflare has recently announced a new feature  called AI Crawl Control . 

  

 

 

What is AI Crawl Control From Cloudflare?

Cloudflare's AI Crawl Control is a tool that gives website owners granular control over how AI services access their content. It evolved from Cloudflare's AI Audit tool to empower content creators with more detailed insights and agency over which AI crawlers are permitted to access their data, as well as for what purpose. 


The solution is designed to address the challenges faced by content creators, who often see their content scraped by AI models without compensation or referral traffic back to their sites.  

 

We have discussed in details on how we can use robots.txt for websites. As a website owner you can specify rules in robots.txt and ask a web crawler like Google Bot what websites it can browse (or index).

Consider the robots.txt file available here -> https://www.sundeepmachado.com/robots.txt 

Sample Robots.txt file
Sample Robots.txt file

  1. User-agent: Mediapartners-Google: This line targets a specific Google bot. Mediapartners-Google is the user agent for Google's AdSense service. It crawls your site's content to serve relevant ads. 
  2. Disallow:: The Disallow: directive is empty. This means nothing is disallowed for the AdSense bot. The owner of this website is explicitly giving the AdSense crawler full permission to access the entire site. 
  3.  User-agent: *: The asterisk * is a wildcard that means this rule applies to all other web crawlers (including Google's main search bot, Bingbot, etc.) that don't have a more specific rule set for them. This also includes all AI agent crawlers like those from OpenAI
  4.  Disallow: /search: This tells all crawlers not to crawl any URLs that are in the /search/ directory of the website. This is a common and recommended practice. Internal site search result pages are often considered "thin content" or duplicate content by search engines, so blocking them from being indexed is good for SEO. 
  5. Allow: /: This directive explicitly permits crawlers to access the root (/) and all other pages and subdirectories of the site not covered by a Disallow rule. It reinforces that the rest of the site is open for crawling. 

 

Cloudflare AI Crawler Dashboard
Cloudflare AI Crawler Dashboard

  1. As you can see above you can allow/block a particular bot.
  2. As seen above archive_org bot  is violating the robots.txt file by accessing the /search url pattern which is mentioned in the fourth point above.

 

Premium Feature : Pay Per Crawl  

Cloudflare's Pay Per Crawl is a system that allows website owners to charge AI companies for crawling their online content. 

 Using this feature, a site owner can choose to freely allow, completely block, or charge a specific price per request for known AI bots. The system uses the 402 Payment Required HTTP status code to signal the need for payment, with Cloudflare handling the entire transaction process.  



 

 

 



0 comments:

Post a Comment

What do you think?.

© 2007 - DMCA.com Protection Status
The content is copyrighted to Sundeep Machado


Note: The author is not responsible for damages related to improper use of software, techniques, tips and copyright claims.