Table of Contents
Fetching ...

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, Xizhou Zhu

TL;DR

The paper tackles the challenge of Vision-Language Models handling long multimodal sequences by showing that text-style positional encodings for visual tokens are suboptimal and context-window limits hinder performance. It introduces Variable Visual Position Encoding (V2PE), which uses smaller, variable increments for visual tokens and a dynamic delta during training to better manage long sequences. By augmenting datasets (Long-VQA, Long-MR, MM-NIAH1M) and fine-tuning InternVL2-2B with V2PE, the authors demonstrate strong gains on long-context benchmarks and competitive results on standard tasks, including processing up to 1M tokens. The work provides a practical pathway for extending long-context capabilities in open-source VLMs and highlights the importance of modality-specific position encoding strategies for long multimodal reasoning.

Abstract

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of the long-context capabilities of VLMs using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model's context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

TL;DR

The paper tackles the challenge of Vision-Language Models handling long multimodal sequences by showing that text-style positional encodings for visual tokens are suboptimal and context-window limits hinder performance. It introduces Variable Visual Position Encoding (V2PE), which uses smaller, variable increments for visual tokens and a dynamic delta during training to better manage long sequences. By augmenting datasets (Long-VQA, Long-MR, MM-NIAH1M) and fine-tuning InternVL2-2B with V2PE, the authors demonstrate strong gains on long-context benchmarks and competitive results on standard tasks, including processing up to 1M tokens. The work provides a practical pathway for extending long-context capabilities in open-source VLMs and highlights the importance of modality-specific position encoding strategies for long multimodal reasoning.

Abstract

Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of the long-context capabilities of VLMs using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model's context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.

Paper Structure

This paper contains 20 sections, 6 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Performance on the image retrieval task using a token length of up to 1M on MM-NIAH mmniah across different VLMs. The GPT-4o gpt4o_web version is 2024-08-06, while InternVL2-2B internvl_1_5 models are both fine-tuned from the official release, using our augmented data, with token lengths reaching up to 256K per sample.
  • Figure 2: Illustration from our proposed Variable Visual Position Encoding (V2PE). Unlike the standard position encoding used in most VLMs, which shares the same stepwise positional increment for both visual and textual tokens, our proposed Variable Visual Positional Encoding (V2PE) uses smaller and variable positional increments specifically for visual tokens compared to textual tokens. This flexible design enables VLMs to handle long-context multimodal sequences within a limited context window.
  • Figure 3: Statistics of our mixed training dataset, including Long-MR, Long-VQA, and short SFT data from InternVL2. The left part and the right part illustrate the distribution of tokens per sample and images per sample, respectively.
  • Figure 4: Performance on Long-VQA (long) with sequence lengths up to 32K tokens and its corresponding standard VQA (short) benchmarks. Compared with GPT-4o and InternVL2-FT-Short, the InternVL2-FT-32K, enhanced with our proposed V2PE, demonstrates comparable outstanding performance across both standard VQA benchmarks and their long-context counterparts.
  • Figure 5: Performance on image retrieval task in MM-NIAH (left) and QA task in Long-VQA (right) with different positional increments. Additionally, we include the results of InternVL2-FT-32K when utilizing linear interpolation (linear) PI and NTK-Aware Scaled RoPE (NTK) ntk position encoding extension methods.
  • ...and 11 more figures