Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Alex Jinpeng Wang; Linjie Li; Yiqi Lin; Min Li; Lijuan Wang; Mike Zheng Shou

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou

TL;DR

This work tackles the high computational cost of extending in-context text in multimodal LLMs by introducing Visualized In-Context Text Processing (VisInContext), which renders long text as images processed by a lightweight vision encoder. By combining Token Masking with a Text-Centric Contrastive Loss (TCCL), VisInContext aligns rendered-text representations with traditional text embeddings, effectively creating a visual text tokenizer and enabling far longer in-context text without proportional FLOP increases. Empirically, extending in-context length from 256 to 2048 tokens yields measurable gains on multimodal few-shot benchmarks, and the approach enhances document understanding tasks such as DocVQA and OCR-VQA, while maintaining efficiency and enabling sequential multimodal retrieval. The method is shown to be compatible with existing MLLM architectures and exhibits potential for broader document understanding applications, albeit with limitations related to fixed image sizes and future work on dynamic rendering.

Abstract

Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text Processing (VisInContext), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs) for both training and inferenceing stage. For instance, our method expands the pre-training in-context text length from 256 to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model. Experimental results demonstrate that model trained with VisInContext delivers superior performance on common downstream benchmarks for in-context few-shot evaluation. Additionally, VisInContext is complementary to existing methods for increasing in-context text length and enhances document understanding capabilities, showing great potential in document QA tasks and sequential document retrieval.

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 4 figures, 8 tables)

This paper contains 28 sections, 1 equation, 4 figures, 8 tables.

Introduction
Method
Terminology
Overall Architecture
Text Rendering
Token Masking
Text-Centric Contrastive Loss (TCCL)
Motivation.
Mechanism.
Experiment
Experimental Setup
Pretraining.
Downstream Evaluation.
In-context Few-shot Evaluation
Impact of Extended In-Context Text Length.
...and 13 more sections

Figures (4)

Figure 1: VisInContext Pipeline. The VisInContext pipeline builds upon the Flamingo model for in-context few-shot modeling (represented in gray). VisInContext processes interleaved image-text data by rendering portions of the in-context text into images. This approach maintains the Text Token Length of the model while allowing for a significantly extended In-context Text Length.
Figure 2: VisInContext significantly improves the OCR ability of LLM. We present the Rendered Text renderedtext2023 images and the corresponding next-word prediction accuracy on the validation set. Using the same pre-training steps, VisInContext achieves significantly better results in predicting words in visual images, even when the fonts are difficult to recognize.
Figure 3: VisInContext extends the in-context text length of MOE based MLLM from 1k to 9k at inference stage.
Figure 4: Sequential multi-modal retrieval example. The input sequence is $I_1,T_1,R_1,I_2,T_2,R_2$ that from interleaved document in OBELICS obelics dataset.

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

TL;DR

Abstract

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)