MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks
Zonglin Wu, Yule Xue, Yaoyao Feng, Xiaolong Wang, Yiren Song
TL;DR
MCA-Bench addresses the need for a unified, large-scale benchmark to evaluate CAPTCHA robustness against VLM-based attacks by integrating four modalities—static visual recognition, point-and-click localization, interactive manipulation, and textual reasoning—across 20 tasks with 180k training samples. The authors fine-tune a shared vision-language backbone (Qwen2.5-VL-7B-Instruct) with LoRA adapters to build task-specific cracking agents, and evaluate against human performance using a single pass-rate metric. Key findings show that modern multimodal backbones achieve high accuracy on simple tasks but struggle with complex interactions and multi-step reasoning, revealing a vulnerability spectrum and guiding practical CAPTCHA hardening. Contributions include the first end-to-end cross-modal CAPTCHA benchmark, a unified evaluation pipeline, and actionable design principles for defense, with open-source datasets and code to enable community replication.
Abstract
As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities -- from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions -- yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.
