Table of Contents
Fetching ...

MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks

Zonglin Wu, Yule Xue, Yaoyao Feng, Xiaolong Wang, Yiren Song

TL;DR

MCA-Bench addresses the need for a unified, large-scale benchmark to evaluate CAPTCHA robustness against VLM-based attacks by integrating four modalities—static visual recognition, point-and-click localization, interactive manipulation, and textual reasoning—across 20 tasks with 180k training samples. The authors fine-tune a shared vision-language backbone (Qwen2.5-VL-7B-Instruct) with LoRA adapters to build task-specific cracking agents, and evaluate against human performance using a single pass-rate metric. Key findings show that modern multimodal backbones achieve high accuracy on simple tasks but struggle with complex interactions and multi-step reasoning, revealing a vulnerability spectrum and guiding practical CAPTCHA hardening. Contributions include the first end-to-end cross-modal CAPTCHA benchmark, a unified evaluation pipeline, and actionable design principles for defense, with open-source datasets and code to enable community replication.

Abstract

As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities -- from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions -- yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.

MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks

TL;DR

MCA-Bench addresses the need for a unified, large-scale benchmark to evaluate CAPTCHA robustness against VLM-based attacks by integrating four modalities—static visual recognition, point-and-click localization, interactive manipulation, and textual reasoning—across 20 tasks with 180k training samples. The authors fine-tune a shared vision-language backbone (Qwen2.5-VL-7B-Instruct) with LoRA adapters to build task-specific cracking agents, and evaluate against human performance using a single pass-rate metric. Key findings show that modern multimodal backbones achieve high accuracy on simple tasks but struggle with complex interactions and multi-step reasoning, revealing a vulnerability spectrum and guiding practical CAPTCHA hardening. Contributions include the first end-to-end cross-modal CAPTCHA benchmark, a unified evaluation pipeline, and actionable design principles for defense, with open-source datasets and code to enable community replication.

Abstract

As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities -- from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions -- yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.

Paper Structure

This paper contains 47 sections, 2 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Data samples from MCA-Bench. Includes four categories and 20 sub-clusters of Point-and-Click Localization, Static Visual Recognition, Textual Logic Q&A and Interactive Manipulation.
  • Figure 2: Schematic overview of the MCA-Bench data-acquisition and annotation workflow. From left to right, the four grey panels are Static Visual Recognition, Interactive Manipulation, Point-and-Click Localization, and Textual Logic Q&A; the red labels mark each category’s data-collection pipeline. Each pipeline has four stages: (i) define the raw input format; (ii) apply task-specific geometric transforms, coordinate projections, or prompt/template generation; (iii) separate fine-grained annotation types; and (iv) save the annotations to text files.
  • Figure 3: Schematic of Data Flow Across Four Framework Stages. The schematic diagram illustrates the data flow and key module configuration across the four stages of the end-to-end framework: unified interface access, gent fine-tuning loading, collaborative inference execution, structured result feedback.
  • Figure 4: Performance Comparison of Multimodal Language Models on MCA-Bench CAPTCHA Tasks. The figure compares the success rates of models including Qwen2.5-VL-7B-Instruct, ChatGPT-4o, Seed1.5-VL, Gemini2.5-Pro, and fine-tuned Qwen2.5-VL-7B-Instruct across MCA-Bench CAPTCHA tasks, covering basic visual recognition, character-based recognition, and advanced multi-step reasoning challenges. Results show fine-tuning consistently improves performance, yet even top-performing models lag behind human-level robustness in complex reasoning tasks.
  • Figure 5: Category distribution of CAPTCHA types in the dataset The dataset covers a wide range of 3D-interactive, text-based, and visually complex CAPTCHA categories, with each type contributing approximately 5% of the total. The lowest-frequency categories (e.g., commonsense reasoning and text-based arithmetic) represent specialized reasoning-based challenges, while the most common types focus on perceptual and motor interactions.
  • ...and 4 more figures