Getting Started with Cloudflare's new AI Crawl Control

AI Crawl Control From Cloudflare

Cloudflare has recently announced a new feature called AI Crawl Control .

What is AI Crawl Control From Cloudflare?

Cloudflare's AI Crawl Control is a tool that gives website owners granular control over how AI services access their content. It evolved from Cloudflare's AI Audit tool to empower content creators with more detailed insights and agency over which AI crawlers are permitted to access their data, as well as for what purpose.

The solution is designed to address the challenges faced by content creators, who often see their content scraped by AI models without compensation or referral traffic back to their sites.

We have discussed in details on how we can use robots.txt for websites. As a website owner you can specify rules in robots.txt and ask a web crawler like Google Bot what websites it can browse (or index).

Consider the robots.txt file available here -> https://www.sundeepmachado.com/robots.txt

Sample Robots.txt file

User-agent: Mediapartners-Google: This line targets a specific Google bot. Mediapartners-Google is the user agent for Google's AdSense service. It crawls your site's content to serve relevant ads.
Disallow:: The Disallow: directive is empty. This means nothing is disallowed for the AdSense bot. The owner of this website is explicitly giving the AdSense crawler full permission to access the entire site.
User-agent: *: The asterisk * is a wildcard that means this rule applies to all other web crawlers (including Google's main search bot, Bingbot, etc.) that don't have a more specific rule set for them. This also includes all AI agent crawlers like those from OpenAI
Disallow: /search: This tells all crawlers not to crawl any URLs that are in the /search/ directory of the website. This is a common and recommended practice. Internal site search result pages are often considered "thin content" or duplicate content by search engines, so blocking them from being indexed is good for SEO.
Allow: /: This directive explicitly permits crawlers to access the root (/) and all other pages and subdirectories of the site not covered by a Disallow rule. It reinforces that the rest of the site is open for crawling.

Cloudflare AI Crawler Dashboard

As you can see above you can allow/block a particular bot.
As seen above archive_org bot is violating the robots.txt file by accessing the /search url pattern which is mentioned in the fourth point above.

Premium Feature : Pay Per Crawl

Cloudflare's Pay Per Crawl is a system that allows website owners to charge AI companies for crawling their online content.

Using this feature, a site owner can choose to freely allow, completely block, or charge a specific price per request for known AI bots. The system uses the 402 Payment Required HTTP status code to signal the need for payment, with Cloudflare handling the entire transaction process.

Getting Started with Cloudflare's new AI Crawl Control

0 comments:

Post a Comment