Table of Contents
Fetching ...

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong Sun

TL;DR

LLaVA-Read tackles the challenge of reading dense textual content embedded in images by introducing a dual visual encoder setup plus a lightweight visual-text encoder based on OCR. It employs layout-aware pretraining and finetuning to align features across encoders and the LLM, with tasks for text recognition, localization, page parsing, and layout recovery, followed by instruction-following fine-tuning. Empirical results show state-of-the-art performance on OCRBench among open-source models and substantial gains on chart and document-centric VQA when including layout information and higher-resolution encoders. The work demonstrates that separating visual object understanding from visual text extraction, and grounding via layout information, yields significant improvements in reading ability for multimodal LLMs, with practical implications for document analysis and accessibility, while acknowledging OCR-related limitations and resource considerations.

Abstract

Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

TL;DR

LLaVA-Read tackles the challenge of reading dense textual content embedded in images by introducing a dual visual encoder setup plus a lightweight visual-text encoder based on OCR. It employs layout-aware pretraining and finetuning to align features across encoders and the LLM, with tasks for text recognition, localization, page parsing, and layout recovery, followed by instruction-following fine-tuning. Empirical results show state-of-the-art performance on OCRBench among open-source models and substantial gains on chart and document-centric VQA when including layout information and higher-resolution encoders. The work demonstrates that separating visual object understanding from visual text extraction, and grounding via layout information, yields significant improvements in reading ability for multimodal LLMs, with practical implications for document analysis and accessibility, while acknowledging OCR-related limitations and resource considerations.

Abstract

Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
Paper Structure (36 sections, 14 figures, 8 tables)

This paper contains 36 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Model overview of LLaVA-Read, a multimodal LLM with dual encoders to handle both visual objects and texts. Given a text-rich image, the visual-text encoder extracts texts and their location information, feeding them to the OCR tokenizer. ViT-based low-resolution encoder (e.g., 336$\times$336) focuses on the global visual information and convolution-based encoder (e.g., 768$\times$768) focuses on visual details. The high-resolution encoder merges its information into low-resolution encoders, as not all details are useful in answering a question.
  • Figure 2: Comparison of word recognition accuracy among different methods using (a) multiple font dimensions against a plain background (b) multiple font dimensions against a natural image background (c) varying word counts.
  • Figure 3: An example that showcases complex reasoning in infographics. It shows LLaVA-Read can comprehend both visual texts and objects within a sophisticated layout.
  • Figure 4: Different length of dense texts with plain background.
  • Figure 5: Different font sizes with natural image background.
  • ...and 9 more figures