Table of Contents
Fetching ...

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

TL;DR

OCR-Reasoning addresses the lack of systematic evaluation for text-rich image reasoning by introducing a 1,069-example, fully annotated benchmark spanning 6 reasoning abilities and 18 practical tasks, with reasoning trajectories alongside final answers. It enables zero-shot evaluation of OCR+LLMs, closed-source MLLMs, and open-source MLLMs, revealing none surpass 50% accuracy and highlighting the critical role of image input, the varying impact of chain-of-thought prompting, and the gap between model families. The findings underscore substantial room for improvement in multimodal text-rich reasoning and provide data collection, annotation, and evaluation pipelines to drive future progress. Limitations include manual annotation costs and reliance on LLMs-as-judges for reasoning evaluation, guiding future improvements in methodology.

Abstract

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

TL;DR

OCR-Reasoning addresses the lack of systematic evaluation for text-rich image reasoning by introducing a 1,069-example, fully annotated benchmark spanning 6 reasoning abilities and 18 practical tasks, with reasoning trajectories alongside final answers. It enables zero-shot evaluation of OCR+LLMs, closed-source MLLMs, and open-source MLLMs, revealing none surpass 50% accuracy and highlighting the critical role of image input, the varying impact of chain-of-thought prompting, and the gap between model families. The findings underscore substantial room for improvement in multimodal text-rich reasoning and provide data collection, annotation, and evaluation pipelines to drive future progress. Limitations include manual annotation costs and reliance on LLMs-as-judges for reasoning evaluation, guiding future improvements in methodology.

Abstract

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

Paper Structure

This paper contains 16 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) The percentage of answers in the benchmark's $Q\&A$ pairs that can be retrieved from the OCR results. (b) An example of the answers that can be retrieved from the OCR results.
  • Figure 2: Data curation framework of OCR-Reasoning. The framework includes: (1) dataset collection, (2) annotation curation, (3) data correction, and (4) detailed taxonomy.
  • Figure 3: Examples of different categories in OCR-Reasoning. OCR-Reasoning includes six categories: spatial Reasoning, numerical analysis reasoning, mathematical reasoning, enumerative reasoning, logical reasoning, and multidisciplinary knowledge reasoning.
  • Figure 4: Key Statistics of OCR-Reasoning.
  • Figure 5: Subject Distribution of OCR-Reasoning.
  • ...and 3 more figures