Table of Contents
Fetching ...

MMR: Evaluating Reading Ability of Large Multimodal Models

Jian Chen, Ruiyi Zhang, Yufan Zhou, Ryan Rossi, Jiuxiang Gu, Changyou Chen

TL;DR

This work proposes the Multi-Modal Reading (MMR) benchmark, the first text-rich image benchmark built on human annotations with the help of language models, and reveals the limited capabilities of existing LMMs underscoring the value of the benchmark.

Abstract

Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

MMR: Evaluating Reading Ability of Large Multimodal Models

TL;DR

This work proposes the Multi-Modal Reading (MMR) benchmark, the first text-rich image benchmark built on human annotations with the help of language models, and reveals the limited capabilities of existing LMMs underscoring the value of the benchmark.

Abstract

Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.
Paper Structure (36 sections, 1 equation, 8 figures, 1 table)

This paper contains 36 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Text and Object length distribution in MMR.
  • Figure 2: Wordcloud of text (Left) and object tags (Right) of MMR Benchmark.
  • Figure 3: Examples of human annotated dense captions. All text elements are annotated in detail, such as color, position, and contents. Detailed descriptions of visual elements and layout information are provided as well.
  • Figure 4: Example questions from MMR to evaluate reading capabilities.
  • Figure : (a)
  • ...and 3 more figures