Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

Daichi Haraguchi

Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

Daichi Haraguchi

TL;DR

It is found that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions.

Abstract

High text recognition performance does not guarantee that Vision-Language Models (VLMs) share human-like decision patterns when resolving ambiguity. We investigate this behavioral gap by directly comparing humans and VLMs using continuously interpolated Japanese character shapes generated via a $β$-VAE. We estimate decision boundaries in a single-character recognition (shape-only task) and evaluate whether VLM responses align with human judgments under shape in context (i.e., embedding an ambiguous character near the human decision boundary in word-level context). We find that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions. These results highlight qualitative behavioral differences, offering foundational insights toward human--VLM alignment benchmarking.

Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

TL;DR

It is found that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions.

Abstract

-VAE. We estimate decision boundaries in a single-character recognition (shape-only task) and evaluate whether VLM responses align with human judgments under shape in context (i.e., embedding an ambiguous character near the human decision boundary in word-level context). We find that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions. These results highlight qualitative behavioral differences, offering foundational insights toward human--VLM alignment benchmarking.

Paper Structure (18 sections, 3 figures)

This paper contains 18 sections, 3 figures.

Introduction
Stimuli Generation via $\beta$-VAE
Overview
Training Procedure and Stimuli Synthesis
Construction of Contextual Word Images
Experimental Procedure for Humans and VLMs
User Study
Shape-only task (RQ1)
Shape-in-context task (RQ2)
Ethics
VLM Experiments
Analysis
Shape-only Character Recognition (RQ1)
Shape-in-Context Word Recognition (RQ2)
Sole-Occurrence Context
...and 3 more sections

Figures (3)

Figure 1: Construction of contextual word images. Examples of sole-occurrence and co-occurrence contexts, where a single character is replaced with an ambiguous glyph (highlighted in red box) selected to yield approximately 50% recognition by human participants in the shape-only task (RQ1). In the co-occurrence context, additional so or n characters appear elsewhere in the word.
Figure 2: Single character recognition by each interpolation parameter (RQ1), aggregated across 10 fonts. Representative interpolated characters from one example font are shown along the x-axis for visual reference.
Figure 3: Shape-in-context word recognition (RQ2). The target character was replaced with an ambiguous glyph X (an interpolation between so and n) in either so-biased or n-biased word contexts, under (a) sole-occurrence and (b) co-occurrence conditions. Asterisks denote significant differences from humans after Bonferroni correction (*$p_{\mathrm{adj}}<.05$, ***$p_{\mathrm{adj}}<.001$).

Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

TL;DR

Abstract

Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)