Table of Contents
Fetching ...

Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation

Arina Kharlamova, Bowei He, Chen Ma, Xue Liu

TL;DR

Spatial CAPTCHA introduces a generative framework to differentiate humans from machines by exploiting spatial reasoning invariants. It builds a parameterized instance generation pipeline and a Spatial-CAPTCHA-Bench covering four spatial abilities and three difficulty levels, yielding $N_{inst}=1050$ total items. On Spatial-CAPTCHA-Bench, humans achieve near-perfect accuracy while state-of-the-art multimodal models top out at about $31.0\%$ Pass@1, revealing systematic invariant violations and calibration gaps in AI models. The framework serves as both a security mechanism and a diagnostic tool for spatial cognition, with open data/code and planned extensions to GUI-interactive and temporal-spatial tasks to improve robustness and real-world applicability.

Abstract

Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs) have eroded the effectiveness of conventional designs that focus on text recognition or 2D image understanding. To address this challenge, we present Spatial CAPTCHA, a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs. Unlike existing CAPTCHAs which rely on low-level perception tasks that are vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation. These skills are intuitive for humans but difficult for state-of-the-art (SOTA) AI systems. The system employs a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to ensure scalability, robustness, and adaptability. Evaluation on a corresponding benchmark, Spatial-CAPTCHA-Bench, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy. Furthermore, we compare Spatial CAPTCHA with Google reCAPTCHA, which confirms its effectiveness as both a security mechanism and a diagnostic tool for spatial reasoning in AI.

Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation

TL;DR

Spatial CAPTCHA introduces a generative framework to differentiate humans from machines by exploiting spatial reasoning invariants. It builds a parameterized instance generation pipeline and a Spatial-CAPTCHA-Bench covering four spatial abilities and three difficulty levels, yielding total items. On Spatial-CAPTCHA-Bench, humans achieve near-perfect accuracy while state-of-the-art multimodal models top out at about Pass@1, revealing systematic invariant violations and calibration gaps in AI models. The framework serves as both a security mechanism and a diagnostic tool for spatial cognition, with open data/code and planned extensions to GUI-interactive and temporal-spatial tasks to improve robustness and real-world applicability.

Abstract

Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs) have eroded the effectiveness of conventional designs that focus on text recognition or 2D image understanding. To address this challenge, we present Spatial CAPTCHA, a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs. Unlike existing CAPTCHAs which rely on low-level perception tasks that are vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation. These skills are intuitive for humans but difficult for state-of-the-art (SOTA) AI systems. The system employs a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to ensure scalability, robustness, and adaptability. Evaluation on a corresponding benchmark, Spatial-CAPTCHA-Bench, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy. Furthermore, we compare Spatial CAPTCHA with Google reCAPTCHA, which confirms its effectiveness as both a security mechanism and a diagnostic tool for spatial reasoning in AI.

Paper Structure

This paper contains 33 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: End-to-end synthesis pipeline. Input variables are sampled from the manifest (formalized in §\ref{['sec:manifest']}).
  • Figure 2: Distributions of response times and accuracies across task types, difficulty levels, and models.
  • Figure 3: Overview of task difficulty, model profiles, and reliability. Colours in (b,c) mark top-5 models from Table \ref{['table:eval']}, where shown results are also consistent with Figure \ref{['fig:time-confidence']}.
  • Figure 4: Illustrative examples of tasks targeting Spatial perception and reference system ability.
  • Figure 5: Examples of tasks probing Spatial orientation and perspective-taking. Participants must mentally adopt alternative viewpoints to determine relative positions or directions of objects. The design highlights the distinction between object-centered transformations (rotation) and observer-centered transformations (orientation shift).
  • ...and 3 more figures