Table of Contents
Fetching ...

Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense

Jiacheng Liu, Yaxin Luo, Jiacheng Cui, Xinyi Shang, Xiaohan Zhao, Zhiqiang Shen

TL;DR

Next-Gen CAPTCHAs address the vulnerability of existing CAPTCHA systems to GUI-enabled agents by exploiting the Cognitive Gap in interactive perception, memory, decision-making, and action. The authors design a procedurally generated, rule-verified suite of 27 CAPTCHA families and an extended POMDP framework to model GUI-agent interaction, plus a scalable data-curation pipeline and a real-web evaluation platform. In live-browser experiments, humans solve near-ceiling (~98.8% Pass@1) while high-reasoning GUI agents remain largely unsuccessful, yielding a substantial defender margin and an economic asymmetry against attacks (cost and latency). The work provides a practical, scalable defense for the agentic web era and motivates accessibility-aware deployment and further study of interactive perception-grounding vulnerabilities.

Abstract

The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like "Bingo". In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.

Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense

TL;DR

Next-Gen CAPTCHAs address the vulnerability of existing CAPTCHA systems to GUI-enabled agents by exploiting the Cognitive Gap in interactive perception, memory, decision-making, and action. The authors design a procedurally generated, rule-verified suite of 27 CAPTCHA families and an extended POMDP framework to model GUI-agent interaction, plus a scalable data-curation pipeline and a real-web evaluation platform. In live-browser experiments, humans solve near-ceiling (~98.8% Pass@1) while high-reasoning GUI agents remain largely unsuccessful, yielding a substantial defender margin and an economic asymmetry against attacks (cost and latency). The work provides a practical, scalable defense for the agentic web era and motivates accessibility-aware deployment and further study of interactive perception-grounding vulnerabilities.

Abstract

The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like "Bingo". In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.
Paper Structure (21 sections, 4 equations, 12 figures, 4 tables)

This paper contains 21 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Frontier Models as GUI Agent Backbones' Pass@1 on our Next-Gen CAPTCHA benchmark.
  • Figure 2: With the enhanced Computer Use abilities like taking screenshot, clicking and dragging etc, along with advanced thinking and tool-use power (e.g, searching), Claude-Cowork-Opus4.5 can now solve "Bingo" CAPTCHA effectively and efficiently.
  • Figure 3: Current CAPTCHAs System Fails. Recent Advanced MLLMs can even break the current CAPTCHAs system without extra high thinking efforts (We use the default thinking settings from their official API)
  • Figure 4: Success--trajectory correlation differs between current and Next-Gen CAPTCHAs.Left & Middle: Spearman $\rho$ between each CAPTCHA family's Pass@1 and logged trajectory metrics (bar plot on the left; heatmap on the right; * denotes significance). Current CAPTCHAs show non-trivial correlations, while Next-Gen correlations are near zero. Right: a representative Next-Gen failure on a jigsaw-like puzzle, where the agent scrolls/evaluates but skips drag-and-drop and prematurely clicks Submit.
  • Figure 5: Next-Gen CAPTCHA family examples. Representative instances from Next-Gen CAPTCHA task families. The full family list is in Table \ref{['tab:captcha_families']}; additional examples for all families are in Appendix \ref{['appendix:captcha_gallery']}.
  • ...and 7 more figures