Table of Contents
Fetching ...

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, William Yang Wang

TL;DR

SEEKER tackles the bottleneck of long-context multimodal understanding by encoding long textual content into compact visual tokens, enabling a fixed token budget to support extended reasoning over multiple images and text. The approach combines long-context multi-image encoding with dense image-text alignment and instruction-tuned supervision, resulting in state-of-the-art performance on six long-context multimodal tasks and competitive results on general benchmarks. Key contributions include a novel image-token encoding scheme with explicit image separators, dense rendering of text into images for improved alignment, and a scalable fine-tuning pipeline that enables long-form input and output without excessive computation. The work demonstrates both improved extrapolation over OCR-based methods and substantial inference-time efficiency, suggesting practical impact for long-document understanding, multi-image reasoning, and video QA in real-world settings.

Abstract

The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently. Our empirical experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach, and is more efficient in understanding long-form multimodal input and generating long-form textual output, outperforming all existing proprietary and open-source MLLMs by large margins.

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

TL;DR

SEEKER tackles the bottleneck of long-context multimodal understanding by encoding long textual content into compact visual tokens, enabling a fixed token budget to support extended reasoning over multiple images and text. The approach combines long-context multi-image encoding with dense image-text alignment and instruction-tuned supervision, resulting in state-of-the-art performance on six long-context multimodal tasks and competitive results on general benchmarks. Key contributions include a novel image-token encoding scheme with explicit image separators, dense rendering of text into images for improved alignment, and a scalable fine-tuning pipeline that enables long-form input and output without excessive computation. The work demonstrates both improved extrapolation over OCR-based methods and substantial inference-time efficiency, suggesting practical impact for long-document understanding, multi-image reasoning, and video QA in real-world settings.

Abstract

The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently. Our empirical experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach, and is more efficient in understanding long-form multimodal input and generating long-form textual output, outperforming all existing proprietary and open-source MLLMs by large margins.
Paper Structure (39 sections, 3 equations, 15 figures, 5 tables)

This paper contains 39 sections, 3 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Left: Performance plot on First-Sentence-Retrieval task revealing compact nature of image tokens in representing long content. Right: Radar chart demonstrating the superior performance of the SEEKER (ours) model across both short and long-context multimodal tasks.
  • Figure 2: Long Multimodal Context Task mainly consists of two elements: 1) long image sequence and text input and 2) long text output.
  • Figure 3: Our Seeker surpass OCR-based model on long multimodal context tasks: 1) process multiple text-rich images naturally. 2) more compact token and fit easily in fix-context length LLM.
  • Figure 4: Density plot comparing token counts for image token (blue) and OCR-text (orange) representations. Image tokens are more compact than text, fitting well within 8192 context length.
  • Figure 5: Generation times for Seeker and Seeker-Tiny with and without OCR.
  • ...and 10 more figures