Table of Contents
Fetching ...

TRINS: Towards Multimodal Language Models that Can Read

Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun

TL;DR

TRINS addresses the gap in multimodal language models' ability to read text within images, which is underexplored due to training data biases. The paper introduces TRINS, a 39,153-text-rich image dataset with 102,437 QA pairs and long captions collected via a semi-automatic pipeline using CLIP and GPT-4, plus TRINS-Cap and TRINS-VQA variants. It also proposes LaRA, a lightweight architecture that fuses a low-resolution visual encoder with an OCR component and a capable decoder, trained on ~90k VQA and ~158k instruction data, while keeping the visual encoder frozen. Empirically, LaRA achieves state-of-the-art results on TRINS-VQA and TRINS-Cap, and maintains or improves performance on standard visual benchmarks, demonstrating that targeted text-rich data and OCR integration can significantly enhance reading comprehension in multimodal models. This work suggests a practical path to robust on-image text understanding with broad applicability in instruction-tuning and multimodal generation.

Abstract

Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.

TRINS: Towards Multimodal Language Models that Can Read

TL;DR

TRINS addresses the gap in multimodal language models' ability to read text within images, which is underexplored due to training data biases. The paper introduces TRINS, a 39,153-text-rich image dataset with 102,437 QA pairs and long captions collected via a semi-automatic pipeline using CLIP and GPT-4, plus TRINS-Cap and TRINS-VQA variants. It also proposes LaRA, a lightweight architecture that fuses a low-resolution visual encoder with an OCR component and a capable decoder, trained on ~90k VQA and ~158k instruction data, while keeping the visual encoder frozen. Empirically, LaRA achieves state-of-the-art results on TRINS-VQA and TRINS-Cap, and maintains or improves performance on standard visual benchmarks, demonstrating that targeted text-rich data and OCR integration can significantly enhance reading comprehension in multimodal models. This work suggests a practical path to robust on-image text understanding with broad applicability in instruction-tuning and multimodal generation.

Abstract

Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.
Paper Structure (29 sections, 18 figures, 11 tables)

This paper contains 29 sections, 18 figures, 11 tables.

Figures (18)

  • Figure 1: Overview of the TRINS data collection process, which consists of three datasets. Text-rich images are first selected from web images and then ask annotators to describe the image in detail. i) TRINS-Cap is extracted from human annotations with heuristic data processing for text-rich image captioning tasks. ii) TRINS-VQA is built upon human annotations and generates question-answer pairs for training by prompting text-only LLMs. iii) TRINS-Gen combined human annotations and text boxes for text-rich image generation.
  • Figure 2: CLIP-based categorization of our collected images and selected representative data samples from each category.
  • Figure 3: Word clouds of (a) predicted tags and (b) detected words from the text-rich images of TRINS.
  • Figure 4: OCR word (a), Caption (b), Question (c) and Answer (d) statistics for TRINS.
  • Figure 5: Question type statistics based on key words.
  • ...and 13 more figures