Table of Contents
Fetching ...

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao

TL;DR

This work proposes SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process that effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways.

Abstract

Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

TL;DR

This work proposes SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process that effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways.

Abstract

Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.
Paper Structure (44 sections, 3 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 44 sections, 3 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: (a) Visualized-Question (VQ) Format. We render the question text into the image as the only question source, removing text-channel shortcuts and requiring visual reading. (b) Capability--Utilization Gap. On Qwen2.5-VL-7B, performance drops markedly under VQ versus standard inputs, indicating that OCR capability is not reliably utilized during reasoning.
  • Figure 2: The SimpleOCR framework. During training, all inputs are transformed into visual question contexts $C_{vq}$, where question text is rendered onto images. This structurally eliminates text-based shortcuts and forces visual OCR engagement. At inference, models trained this way demonstrate robust performance on standard inputs $C_{orig}$. The method integrates seamlessly as an augmentation branch in existing RL frameworks.
  • Figure 3: Performance on OCR-intensive benchmarks. SimpleOCR demonstrates superior performance, achieving 81.6% on ChartQA and 69.1% on HallusionBench.
  • Figure 4: The "U-Shaped" Optimization Conflict. We report the average performance across four representative OOD benchmarks. The mixed strategy (50% VQ) results in a net performance loss, illustrating that contradictory modality signals hinder generalization.
  • Figure 5: Left: On MathVista, the GRPO baseline is misled by hallucinated semantic priors, while SimpleOCR correctly identifies material properties. Right: On ChartQA, the baseline relies on superficial keyword spotting, whereas SimpleOCR performs holistic visual analysis. Blue: correct grounding; red: heuristic errors.