LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong Sun
TL;DR
LLaVA-Read tackles the challenge of reading dense textual content embedded in images by introducing a dual visual encoder setup plus a lightweight visual-text encoder based on OCR. It employs layout-aware pretraining and finetuning to align features across encoders and the LLM, with tasks for text recognition, localization, page parsing, and layout recovery, followed by instruction-following fine-tuning. Empirical results show state-of-the-art performance on OCRBench among open-source models and substantial gains on chart and document-centric VQA when including layout information and higher-resolution encoders. The work demonstrates that separating visual object understanding from visual text extraction, and grounding via layout information, yields significant improvements in reading ability for multimodal LLMs, with practical implications for document analysis and accessibility, while acknowledging OCR-related limitations and resource considerations.
Abstract
Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
