Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim; Hodong Lee; Daehee Kim; Haeji Jung; Sanghee Park; Yoonsik Kim; Sangdoo Yun; Taeho Kil; Bado Lee; Seunghyun Park

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, Seunghyun Park

TL;DR

<3-5 sentence high-level summary> Cream tackles the challenge of visually-situated natural language understanding on text-rich images by pairing a vision encoder with auxiliary OCR/object encoders and a contrastive feature alignment objective. It integrates with frozen large language models through fixed-size soft visual prompts, reducing OCR-token bloat and improving efficiency. The approach is trained in a unified multitask framework across TR, MTP, captioning, QA, and QG, and validated on challenging Document VQA benchmarks, showing strong performance and robustness to OCR noise. The work contributes a novel architecture, contrastive training strategy, and open-source datasets/code to advance visual document understanding and multimodal reasoning with LLMs.

Abstract

Recent advances in Large Language Models (LLMs) have stimulated a surge of research aimed at extending their applications to the visual domain. While these models exhibit promise in generating abstract image captions and facilitating natural conversations, their performance on text-rich images still requires improvement. In this paper, we introduce Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details that are often overlooked in existing methods. Cream combines vision and auxiliary encoders, fortified by a contrastive feature alignment technique, to achieve a more effective comprehension of language information in visually situated contexts within the images. Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants. Through rigorous evaluations across diverse visually-situated language understanding tasks that demand reasoning capabilities, we demonstrate the compelling performance of Cream, positioning it as a prominent model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/cream .

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

TL;DR

Abstract

Paper Structure (55 sections, 1 equation, 17 figures, 11 tables)

This paper contains 55 sections, 1 equation, 17 figures, 11 tables.

Introduction
Related Work
Applying LLMs to Visually-Situated NLU
Visual Document Understanding
Method
Contrastive Reading Model
Architecture
Vision Encoder
Auxiliary Encoder
Decoder
Contrastive Feature Alignment
Integration of Cream and LLMs
Model Training
Tasks
Text Reading (TR)
...and 40 more sections

Figures (17)

Figure 1: Comparison on a text-rich image. The proposed method, Cream, precisely interprets and reads the relevant store's name from a poster containing multiple text instances and visuals, overcoming limitations of existing approaches (e.g., OCR+ChatGPT). Our Cream efficiently extracts visual features from the image, thus enabling LLMs to provide an appropriate response.
Figure 2: Overview of Cream's Framework. (a) Image patches are fed into the vision encoder, while information extracted from off-the-shelf detectors is processed through the auxiliary encoders if available. The encoded vectors are concatenated and then cross-attended in the decoder. The decoder, receiving both a learned query vector and a user query as inputs, serves as a soft visual prompter for the LLM. Note that the encoders are frozen during the training with LLMs. (b) Encoded vector representations are effectively aligned using a contrastive learning scheme.
Figure 3: Token embeddings in the auxiliary encoder. The 2D positional embeddings are computed using the center point of each bounding box. Text embeddings are obtained through a lookup operation on a subword embedding matrix. For simplicity, words are plotted instead of subwords.
Figure 4: Unified multitask framework. The list of full prompts that we used is available in Appendix \ref{['sec:appendix_cream_prompts']}.
Figure 5: Examples of synthetic VQA datasets. Examples of other datasets are available in Appendix \ref{['sec:appendix_datasets']}.
...and 12 more figures

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

TL;DR

Abstract

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)