Exploring the Capabilities of Large Multimodal Models on Dense Text

Shuo Zhang; Biao Yang; Zhang Li; Zhiyin Ma; Yuliang Liu; Xiang Bai

Exploring the Capabilities of Large Multimodal Models on Dense Text

Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai

TL;DR

This work addresses the challenge of evaluating dense-text understanding in visual models by introducing DT-VQA, a dense-text Visual Question Answering dataset with 170k QA pairs across 30k images drawn from four text-rich sources. The authors benchmark a range of large multimodal models, including GPT-4V, Gemini, and open-source LMMs, and demonstrate that dense-text questions remain difficult even for state-of-the-art systems. They propose two pragmatic strategies—prompt engineering and downstream fine-tuning—showing that prompts can modestly improve ANLS while fine-tuning on automatically labeled DT-VQA data yields substantial gains, especially for open models. Additionally, they introduce AccANLS, a metric designed to balance recognition errors and output length, providing a more robust evaluation for dense-text VQA. The work highlights the need for specialized dense-text benchmarks and data-driven training approaches to advance practical information extraction from text-rich images.

Abstract

While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://github.com/Yuliang-Liu/MultimodalOCR.

Exploring the Capabilities of Large Multimodal Models on Dense Text

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 6 figures, 4 tables)

This paper contains 17 sections, 1 equation, 6 figures, 4 tables.

Introduction
Related work
Text-oriented Datasets
Large Multi-modal Models
DT-VQA Dataset
Images
Annotation
Statistics and Analysis
Benchmark
Evaluation Metric
Baseline
Strategy
Result
Quantitative Result
Qualitative Result
...and 2 more sections

Figures (6)

Figure 1: Visualization of VQA errors of LMMs on dense text images.
Figure 2: Question-answer pair generation pipeline. (Note: Only the question and answer pairs from Hiertext hiertext and POIE poie are generated with OCR information input, which is the gray part in the picture above.)
Figure 3: Statistics and analysis of DT-VQA.
Figure 4: (a) Downstream fine-tuning. (b) Prompt engineering.
Figure 5: Visualization of LMM responses before and after prompt engineering and downstream supervised fine-tuning, where (a)(b)(c) correspond to prompt engineering and (d)(e)(f) correspond to downstream supervised fine-tuning.
...and 1 more figures

Exploring the Capabilities of Large Multimodal Models on Dense Text

TL;DR

Abstract

Exploring the Capabilities of Large Multimodal Models on Dense Text

Authors

TL;DR

Abstract

Table of Contents

Figures (6)