CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

Yuxi Chen; Haoyu Zhai; Chenkai Wang; Rui Yang; Lingming Zhang; Gang Wang; Huan Zhang

CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

Yuxi Chen, Haoyu Zhai, Chenkai Wang, Rui Yang, Lingming Zhang, Gang Wang, Huan Zhang

Abstract

GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated data collection and curation pipeline that generates large-scale CAPTCHA interaction trajectories paired with reasoning traces. As CAPTCHA solving often requires multi-step interaction and recovery from intermediate mistakes, we further leverage failed trajectories to construct self-correction data, training agents to reflect on errors and correct their actions online. Across held-out test sets, ReCAP improves CAPTCHA-solving success from roughly 30\% to 80\%, while maintaining strong performance on general GUI-agent benchmarks.

CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

Abstract

Paper Structure (46 sections, 1 equation, 3 figures, 5 tables)

This paper contains 46 sections, 1 equation, 3 figures, 5 tables.

Introduction
Related Works
GUI Agent
Evolution of CAPTCHAs
Automated CAPTCHA Solving
Methodology
Dynamic CAPTCHA System
Stochastic Rendering
Unbounded Generation via Visual Diversity
Scalable Data Collection and Curation
Solution Trace Generation with Reasoning Data
Self-correction Trace Generation
Multi-Action Outputs
Training Paradigm
Unified Loss Function
...and 31 more sections

Figures (3)

Figure 1: Performance on CAPTCHA and general GUI agent benchmarks.Left: CAPTCHA solving performance across seven challenge types. Right: Performance on general GUI agent benchmarks. ReCAP-32B consistently outperforms baseline GUI agents and prior frameworks on all seven CAPTCHA challenges while maintaining strong general GUI capabilities.
Figure 2: The suite of CAPTCHA challenges in our dynamic CAPTCHA system, designed to train fundamental CAPTCHA-solving primitives. The challenges are grouped into four core interaction primitives: Optical Character Recognition (OCR), Continuous Control (Dragging), Spatial Localization (Clicking), and Visual Semantic Comprehension.
Figure 3: Data collection and curation pipeline for our CAPTCHA training dataset. a) shows the reasoning solution trace generation pipeline; b) shows the self-correction trace generation pipeline.

CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

Abstract

CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

Authors

Abstract

Table of Contents

Figures (3)