Table of Contents
Fetching ...

Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models

Heng Zhou, Ao Yu, Li Kang, Yuchen Fan, Yutao Fan, Xiufeng Song, Hejia Geng, Yiran Qin

TL;DR

This work systematically investigates the typographic gap in vision-language understanding by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels, finding that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling.

Abstract

Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.

Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models

TL;DR

This work systematically investigates the typographic gap in vision-language understanding by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels, finding that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling.

Abstract

Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.
Paper Structure (77 sections, 10 figures, 14 tables, 1 algorithm)

This paper contains 77 sections, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: Reading $\neq$ Seeing. Given a single rendered image, four state-of-the-art VLMs unanimously read the text content correctly yet systematically misidentify its font family, size, and style. Only color, a pixel-level cue, is recognized correctly by all models. The right panel quantifies this gap: color accuracy reaches 95--100%, while font style peaks at only 23--34%, barely above the 25% random baseline.
  • Figure 2: Evaluation pipeline.(1) Text Corpus: script-appropriate sentences spanning Latin, CJK, Arabic, and Devanagari are paired with 26 fonts. (2) Image Rendering: each sentence is rendered at 96 DPI with anti-aliasing, sampling from 26 fonts $\times$ 8 sizes $\times$ 4 styles $\times$ 8 colors, with background chosen to guarantee a 4.5:1 contrast ratio. (3) MCQ Generation: each rendered image yields four multiple-choice questions, one per property, with hard within-category distractors, producing 1,000 questions at three difficulty levels. (4) VLM Evaluation: models are queried at temperature 0; responses are parsed with a 4-step cascade.
  • Figure 3: Dataset statistics.(A) Key dimensions. (B) Sample distribution across scripts and difficulty levels. (C) Font category breakdown.
  • Figure 4: (a) Resolution ablation: GPT-5.2 and Gemini-3-Flash peak at 1$\times$ with non-monotonic degradation in both directions; Qwen3-VL-8B remains stable. (b) Robustness to image degradation: Gemini-3-Flash suffers the largest absolute drops under blur ($-$17.3pp) and JPEG compression ($-$10.2pp), while Qwen3-VL-8B is the most stable. Dashed lines indicate 1$\times$ baselines.
  • Figure 5: Attention heatmaps for hard-split failures. Self-attention weights from the final decoder layer of Qwen2.5-VL-7B, averaged over all heads. Font Family: attention spreads uniformly; discriminative glyph features are not attended to. Font Size: attention covers only part of the text, failing to integrate spatial extent. Font Style: attention is character-level but assigns equal weight to all strokes, missing the relative thickness that distinguishes bold from regular.
  • ...and 5 more figures