Benchmarking bot detection systems against modern AI agents

We evaluate five leading invisible CAPTCHAs and find significant differences in their ability to detect modern bots and AI agents.

CAPTCHAs and other bot detection methods were originally designed to stop automated scripts that operate very differently from human users. It's unclear how well these detection systems perform against modern AI agents that operate real browsers and interact with web applications in more human-like ways.

To investigate, we designed a comprehensive benchmark to test bot detection systems against a diverse range of automated agents. We evaluate five leading detection systems—Google reCAPTCHA v3, hCaptcha, FingerprintJS Pro, Cloudflare Turnstile, and Roundtable Proof of Human—and find significant differences in performance. Detection rates range from 33% to 87%, with systems that rely on both behavior and device signals significantly outperforming device-only approaches.

Background

Here, we introduce a new benchmark for evaluating bot detection systems based on their ability to identify different types of automated agents. This involves (1) creating a set of real-world web interaction tasks, (2) programming bots to complete these tasks, and (3) measuring the performance of different systems in identifying these bots.

This benchmark focuses specifically on invisible bot detection methods that don't increase friction for humans through challenges or user interaction requirements. Modern CAPTCHAs and other challenge-based systems can often filter bots but create significant user experience friction. In theory, one could achieve near-perfect bot detection accuracy by adding enough friction (multiple complex CAPTCHAs, multi-step verification, etc.). However, this approach is impractical for real-world applications, as it leads to high false positive rates and user frustration.

Methodology

We constructed a new benchmark to evaluate the performance of bot detection systems. The benchmark is based on five distinct web tasks:

Task Description Relevant Behaviors
Sign up form Registration form with name, email, and password fields Typing patterns, field navigation, form completion speed
Online survey Multi-question survey with various input types Response timing, selection patterns, scrolling behavior
Review form Product review with star rating and text feedback Click precision, text entry patterns, rating interaction
Article unlock Email capture form to access premium content Mouse movements, typing rhythm, submission timing
Psychological experiment Cognitive task requiring attention and decision-making Reaction times, attention patterns, cognitive load indicators

We used three different types of bots in our evaluation, each in equal proportion: traditional bots, browser automation tools, and AI agents. Traditional bots were programmed as simple scripts that hard-coded the task logic ahead of time (i.e., clicking specific buttons or filling out forms with predetermined behaviors at specific time intervals). Browser automation tools take control of a human's real browser environment, and can be intelligent (using AI to understand the task) or non-intelligent (using hard-coded behaviors). AI agents are advanced bots that use LLMs to understand and complete tasks with minimal human oversight.

Example review form
Figure 1: Review form task used in the benchmark. This task requires entering a name, clicking a star rating, and entering a review of at least 50 characters. At submission time, the bot detection system is triggered to evaluate whether the interaction is human or bot.

We compared five different bot detection systems: Roundtable Proof of Human, Google reCAPTCHA v3, hCaptcha 99.9% Passive Mode, FingerprintJS Pro, and Cloudflare Turnstile. For each system, we used the following criteria to identify bots:

We evaluated each bot detection system 10 times on each task/bot combination and measured performance in terms of percentage of bots identified. This gave us a final sample size of 150 sessions (5 tasks x 3 bot types x 10 trials) for each detection system (n=750 total).

ChatGPT agent
Figure 2: ChatGPT agent filling out the sign up-up form. ChatGPT agent triggers both behavioral and device flags in Roundtable's system but bypasses Google reCaptcha v3.

Bot detection performance was determined as the overall proportion of bots identified across all tasks and bot types. The final results are shown in Figure 3.

Bot benchmarking results
Figure 3: Bot benchmarking results. Each bar shows the percentage of bots identified by the given detection system across all tasks and bot types. Each system was evaluated on 150 bot sessions.

Roundtable Proof of Human achieved the highest bot detection score, successfully identifying over 86% of bots across all tasks and bot types. Google reCAPTCHA v3 and hCaptcha scored next best, with FingerprintJS Pro and Cloudflare Turnstile performing last.

Why behavioral systems do better

The performance gap between behavioral and device-only systems reflects fundamental differences in detection approaches. Google reCAPTCHA v3, hCaptcha, and Roundtable Proof of Human analyze both device and user behavior patterns. By contrast, FingerprintJS Pro and Cloudflare Turnstile rely primarily on device fingerprinting and network characteristics.

Keystroke dynamics
Figure 4: Human and bot typing patterns. Compared to humans, bot typing is often unnaturally consistent and only moves in the forward direction.

While device-only approaches can detect traditional bots using fake browsers or suspicious network signatures, they struggle against modern AI agents operating in real browser environments. However, these agents still exhibit detectable behavioral patterns (e.g. perfect click precision and unnaturally consistent reaction times) that behavioral systems can identify.

Limitations

Several limitations should be considered when interpreting these results.

First, our benchmarking results depend on complex behavioral, device, and network data that is difficult to package into a reproducible dataset. Unlike traditional machine learning benchmarks with static inputs and outputs, bot detection evaluation requires real-time browser interactions and analyses that cannot be easily captured or replayed for independent verification by other researchers. However, we hope to release code and raw data for a benchmark in the future so others can reproduce our results.

Second, this benchmark did not evaluate the false positive rates of different systems with human users. A complete evaluation would need to measure how often each system incorrectly flags legitimate human users as bots, as this directly impacts user experience and business metrics in real-world deployments.

Finally, our evaluation did not involve active adversarial optimization. Instead, we relied on off-the-shelf AI agents, browser automation tools, and traditional bots in their default configurations. While many of these tools are already designed to evade detection, a dedicated adversary could invest more effort in bypassing specific detection systems through techniques like prompt optimization, behavior randomization, or custom evasion strategies. In future work we will include more aggressive red-teaming approaches to measure the robustness of these detection systems against determined attackers.

The road ahead

Today, AI agents have relatively limited capabilities and are expensive to run at scale. However, as these systems become more sophisticated, the landscape of bot detection will continue to evolve rapidly. The techniques we've evaluated today represent the foundation for an evolving field where detection methods must continuously advance to keep pace with technological progress.

At Roundtable, we're staying on the cutting edge of this evolution and continuously developing new approaches to distinguish human behavior from automated systems. If you'd like to try our Proof of Human system, you can get started for free at roundtable.ai

hCaptcha's 99.9% Passive Mode presents "challenges" to the user when the system is uncertain or believes the interaction is likely from a bot. This makes the comparison not entirely apples-to-apples with fully invisible systems. We classified hCaptcha sessions as "bots" on every session where challenges were presented, unless an AI agent successfully completed the challenge. Many AI agents refuse to attempt to complete CAPTCHAs, and we did not explicitly encourage CAPTCHA completion via our prompts. The hCaptcha results thus represent an upper bound estimate of hCaptcha's performance as most agents could easily solve the challenges when they attempted to.

Mathew Hardy and Mayank Agrawal are co-founders of Roundtable Technologies Inc., where they work on building proof of human authentication systems. Previously, they completed their PhDs in cognitive science at Princeton University.