We evaluate five leading invisible CAPTCHAs and find significant differences in their ability to detect modern bots and AI agents.
CAPTCHAs and other bot detection methods were originally designed to stop automated scripts that operate very differently from human users. It's unclear how well these detection systems perform against modern AI agents that operate real browsers and interact with web applications in more human-like ways.
To investigate, we designed a comprehensive benchmark to test bot detection systems against a diverse range of automated agents. We evaluate five leading detection systems—Google reCAPTCHA v3, hCaptcha, FingerprintJS Pro, Cloudflare Turnstile, and Roundtable Proof of Human—and find significant differences in performance. Detection rates range from 33% to 87%, with systems that rely on both behavior and device signals significantly outperforming device-only approaches.
Here, we introduce a new benchmark for evaluating bot detection systems based on their ability to identify different types of automated agents. This involves (1) creating a set of real-world web interaction tasks, (2) programming bots to complete these tasks, and (3) measuring the performance of different systems in identifying these bots.
This benchmark focuses specifically on invisible bot detection methods that don't increase friction for humans through challenges or user interaction requirements. Modern CAPTCHAs and other challenge-based systems can often filter bots but create significant user experience friction. In theory, one could achieve near-perfect bot detection accuracy by adding enough friction (multiple complex CAPTCHAs, multi-step verification, etc.). However, this approach is impractical for real-world applications, as it leads to high false positive rates and user frustration.
We constructed a new benchmark to evaluate the performance of bot detection systems. The benchmark is based on five distinct web tasks:
Task | Description | Relevant Behaviors |
---|---|---|
Sign up form | Registration form with name, email, and password fields | Typing patterns, field navigation, form completion speed |
Online survey | Multi-question survey with various input types | Response timing, selection patterns, scrolling behavior |
Review form | Product review with star rating and text feedback | Click precision, text entry patterns, rating interaction |
Article unlock | Email capture form to access premium content | Mouse movements, typing rhythm, submission timing |
Psychological experiment | Cognitive task requiring attention and decision-making | Reaction times, attention patterns, cognitive load indicators |
We used three different types of bots in our evaluation, each in equal proportion: traditional bots, browser automation tools, and AI agents. Traditional bots were programmed as simple scripts that hard-coded the task logic ahead of time (i.e., clicking specific buttons or filling out forms with predetermined behaviors at specific time intervals). Browser automation tools take control of a human's real browser environment, and can be intelligent (using AI to understand the task) or non-intelligent (using hard-coded behaviors). AI agents are advanced bots that use LLMs to understand and complete tasks with minimal human oversight.
We compared five different bot detection systems: Roundtable Proof of Human, Google reCAPTCHA v3, hCaptcha 99.9% Passive Mode, FingerprintJS Pro, and Cloudflare Turnstile. For each system, we used the following criteria to identify bots:
We evaluated each bot detection system 10 times on each task/bot combination and measured performance in terms of percentage of bots identified. This gave us a final sample size of 150 sessions (5 tasks x 3 bot types x 10 trials) for each detection system (n=750 total).
Bot detection performance was determined as the overall proportion of bots identified across all tasks and bot types. The final results are shown in Figure 3.
Roundtable Proof of Human achieved the highest bot detection score, successfully identifying over 86% of bots across all tasks and bot types. Google reCAPTCHA v3 and hCaptcha scored next best, with FingerprintJS Pro and Cloudflare Turnstile performing last.
The performance gap between behavioral and device-only systems reflects fundamental differences in detection approaches. Google reCAPTCHA v3, hCaptcha, and Roundtable Proof of Human analyze both device and user behavior patterns. By contrast, FingerprintJS Pro and Cloudflare Turnstile rely primarily on device fingerprinting and network characteristics.
While device-only approaches can detect traditional bots using fake browsers or suspicious network signatures, they struggle against modern AI agents operating in real browser environments. However, these agents still exhibit detectable behavioral patterns (e.g. perfect click precision and unnaturally consistent reaction times) that behavioral systems can identify.
Several limitations should be considered when interpreting these results.
First, our benchmarking results depend on complex behavioral, device, and network data that is difficult to package into a reproducible dataset. Unlike traditional machine learning benchmarks with static inputs and outputs, bot detection evaluation requires real-time browser interactions and analyses that cannot be easily captured or replayed for independent verification by other researchers. However, we hope to release code and raw data for a benchmark in the future so others can reproduce our results.
Second, this benchmark did not evaluate the false positive rates of different systems with human users. A complete evaluation would need to measure how often each system incorrectly flags legitimate human users as bots, as this directly impacts user experience and business metrics in real-world deployments.
Finally, our evaluation did not involve active adversarial optimization. Instead, we relied on off-the-shelf AI agents, browser automation tools, and traditional bots in their default configurations. While many of these tools are already designed to evade detection, a dedicated adversary could invest more effort in bypassing specific detection systems through techniques like prompt optimization, behavior randomization, or custom evasion strategies. In future work we will include more aggressive red-teaming approaches to measure the robustness of these detection systems against determined attackers.
Today, AI agents have relatively limited capabilities and are expensive to run at scale. However, as these systems become more sophisticated, the landscape of bot detection will continue to evolve rapidly. The techniques we've evaluated today represent the foundation for an evolving field where detection methods must continuously advance to keep pace with technological progress.
At Roundtable, we're staying on the cutting edge of this evolution and continuously developing new approaches to distinguish human behavior from automated systems. If you'd like to try our Proof of Human system, you can get started for free at roundtable.ai
hCaptcha's 99.9% Passive Mode presents "challenges" to the user when the system is uncertain or believes the interaction is likely from a bot. This makes the comparison not entirely apples-to-apples with fully invisible systems. We classified hCaptcha sessions as "bots" on every session where challenges were presented, unless an AI agent successfully completed the challenge. Many AI agents refuse to attempt to complete CAPTCHAs, and we did not explicitly encourage CAPTCHA completion via our prompts. The hCaptcha results thus represent an upper bound estimate of hCaptcha's performance as most agents could easily solve the challenges when they attempted to.
Mathew Hardy and Mayank Agrawal are co-founders of Roundtable Technologies Inc., where they work on building proof of human authentication systems. Previously, they completed their PhDs in cognitive science at Princeton University.