Table of Contents
Fetching ...

Perceptual Gaps: ASCII Art and Overlapping Audio as CAPTCHA

Choon-Hou Rafael Chong

Abstract

As multimodal large language models (LLMs) advance, traditional CAPTCHAs have become obsolete at distinguishing humans from bots. To address this shift, this paper aims to investigate the possibility of using tasks for which humans have evolved highly specialised neural processing. We introduce two CAPTCHA classes: a vision-based CAPTCHA, which renders alphanumeric strings as ASCII art, and an audio-based CAPTCHA, which is a question-answering task with overlapping or noise-corrupted audio context. We evaluate our vision-based CAPTCHA both as text and image input with multiple frontier LLMs (GPT 5.2, Gemini 3, etc.), and assess our audio-based CAPTCHAs by applying augmentations like background noise, Gaussian noise, and overlapping speech. We determined that none of the LLMs were able to solve a single ASCII-based CAPTCHA, with the best performing model only being able to infer at most one or two characters. Additionally, all models that supported audio performed only modestly better than random when solving audio CAPTCHAs. Our results suggest that these CAPTCHAs are exceptionally effective today, but it is unclear whether it can withstand the fast-evolving landscape of artificial intelligence. Subsequent research is needed to determine whether these tasks are temporary vulnerabilities or represent a more durable method of distinguishing humans from bots.

Perceptual Gaps: ASCII Art and Overlapping Audio as CAPTCHA

Abstract

As multimodal large language models (LLMs) advance, traditional CAPTCHAs have become obsolete at distinguishing humans from bots. To address this shift, this paper aims to investigate the possibility of using tasks for which humans have evolved highly specialised neural processing. We introduce two CAPTCHA classes: a vision-based CAPTCHA, which renders alphanumeric strings as ASCII art, and an audio-based CAPTCHA, which is a question-answering task with overlapping or noise-corrupted audio context. We evaluate our vision-based CAPTCHA both as text and image input with multiple frontier LLMs (GPT 5.2, Gemini 3, etc.), and assess our audio-based CAPTCHAs by applying augmentations like background noise, Gaussian noise, and overlapping speech. We determined that none of the LLMs were able to solve a single ASCII-based CAPTCHA, with the best performing model only being able to infer at most one or two characters. Additionally, all models that supported audio performed only modestly better than random when solving audio CAPTCHAs. Our results suggest that these CAPTCHAs are exceptionally effective today, but it is unclear whether it can withstand the fast-evolving landscape of artificial intelligence. Subsequent research is needed to determine whether these tasks are temporary vulnerabilities or represent a more durable method of distinguishing humans from bots.

Paper Structure

This paper contains 37 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: reCAPTCHA v2
  • Figure 2: Comparison of CAPTCHA solving by LLMs
  • Figure 3: Example of ChatGPT 5.2 and Gemini 3 Pro (High Thinking) failing to solve a very simple ASCII Captcha.