The Great AI Crawler Chaos

25 November 2024

At Glow we’ve noticed with the increase in artificial intelligence tools, website owners are facing a growing challenge. How do you differentiate between legitimate AI crawlers (that might enhance visibility of your website or help inform tools such as ChatGPT) and bad bots that suck resources and hamper performance. Bad bots may be intentionally malevolent or just badly configured. It is not just GoogleBot scanning your site, we discuss our experience of a site being targeted by 10s of bots.

This article dives into technical strategies that your website developer might use to help you manage this chaos. We'll look at configuring robots.txt, rate limiting, IP blocking, blocking headers in IIS, and setting up Cloudflare.

Crawlers

Crawlers (bots) are automated services designed to traverse your website for various purposes. Googlebot and Bingbot are two legitimate “legacy” examples used for indexing that you might be familiar with from carrying out online searches. AI tools like OpenAI's use crawlers, they are 'newish' legitimate bots that are used in AI training. Not all crawlers are well-intentioned or well configured, so distinguishing between legitimate and bad is the key challenge.

Let’s discuss some steps you can take to tackle this issue.

Step 1: Configure robots.txt

The robots.txt file is a text file placed in your website’s root directory to provide instructions for bots. While it’s respected by most legitimate bots, we recently noticed one of our ecommerce sites had extremely high traffic, it seemed malicious crawlers (and some immature crawlers) were ignoring the robots.txt file.

You may be familiar with the basic text of a robots.txt file, here’s an example

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /private/

User-agent: * applies to all crawlers.
Disallow: /private/ blocks access to sensitive directories.
Allow: / grants specific bots full access.

On the ecommerce site we were working on, we started with a review of IIS traffic logs and robots.txt. Doing this we were able to;

Identify The legitimate and the bad bots: See Bot Verification for how to do this with Google Crawlers and Overview of OpenAI Crawlers - OpenAI API for a list of IPs ChatGPT crawlers use. You may notice some bots mimicking those crawlers too.
Disallow Sensitive Areas: We restricted access to non-public areas, your site might disallow /administration/ for example.
Introduce Crawl Delay: For some legitimate but overactive bots it is possible to introduce a crawl delay see Crawl delay and the Bing crawler, MSNBot | Bing Webmaster Blog
Test with Google Search Console: Validate your robots.txt file for errors to avoid accidental indexing issues robots.txt report - Search Console Help

Of course using robots.txt cannot prevent bots from ignoring your rules (and many legitimate bots will ignore Crawl Delay altogether). You’ll need to carry out further steps for them.

Step 2: Implement Rate Limiting

Since Crawl Delay was not universally respected we thought about introducing rate limiting controls. These limit the number of requests a single IP can make to your website in a given time period.

There are many techniques for rate limiting, such as these two

IIS Rate limiting see Deny by Request Rate <denyByRequestRate> | Microsoft Learn
Cloudflare Rate limiting: Cloudflare offers rate limiting rules to block suspicious traffic before it reaches your server. Rate limiting rules | Cloudflare Web Application Firewall (WAF) docs

Rate limiting can stop your site from being overwhelmed and will mitigate the impact of Distributed Denial of Service (DDoS) attacks.

We had to be careful introducing these rules as badly configured rate limits can block legitimate traffic, especially if your APIs or services depend on high request rates. In e-commerce blocking potential customers is something essential to avoid.

Step 3: Use IP Blocking Wisely

Blocking IPs can be an effective way to stop some spam traffic. However, it's important to use this method wisely to avoid blocking legitimate website users. The technique is limited since crawlers can be directed from different IPs.

There are various methods of IP blocking

Windows Firewall: Block an IP Address on a Windows Server
IIS Request Filtering: See Block an IP address accessing the application
Cloudflare IP Access Rules: See Overview | Cloudflare Web Application Firewall to get started.

Advanced bots often use rotating proxies or VPNs to evade static IP blocks. Be careful using country blocks in Cloudflare, you might block a lot of legitimate traffic.

Step 4: Block Requests by Header in IIS

IIS allows you to configure rules based on request headers. Some malicious bots send identifying patterns in their headers (often in the User-Agent string), if you can identify these you can potentially block them.

Configuring Blocking Rules:

Open IIS Manager.
Navigate to the site’s Request Filtering section.
Add blocking rules for specific headers.

We found that malicious bots were mimicking similar User-Agent strings with legitimate bots, making it challenging to block accurately without risking false positives. In addition malicious bots can change their User-Agent string.

Step 5: Further Cloudflare Configuration

Cloudflare acts as a web application firewall. We’ve already discussed rate limiting in Cloudflare but you might like to read about these topics that we’ve needed to become familiar with to get further control of the bots.

Under Attack Mode: Under Attack mode | Cloudflare Fundamentals docs
Challenging bots: Challenge bad bots | Cloudflare Web Application Firewall (WAF) docs
Scoring bots: Bot scores | Cloudflare bot solutions docs

Cloudflare has recently introduced a new feature to allow you to analyse the AI Bot traffic on your site. A screen shot of the new tool is shown below, and you can read more about their invention.

Summing up…

Distinguishing between legitimate and badly behaved crawlers requires a combination of strategies, use of robots.txt configurations, rate limiting, IP blocking, and Cloudflare rules. No approach on its own is foolproof, working with your website developer to layer these methods helps to ensure a robust defense against bad bots while allowing legitimate traffic to access your website. If you have contact forms, you might be interested in how recaptchas and honeytraps work too.

Are you affected by or looking to mitigate any of the issues discussed? Get in touch - we can help.