Table of Contents
Fetching ...

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang

TL;DR

Inter-modal Distance Invariant Position Encoding (DIPE) is proposed, a simple but effective mechanism that disentangles position encoding based on modality interactions and effectively mitigates the inter-modal distance-based penalty.

Abstract

Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

TL;DR

Inter-modal Distance Invariant Position Encoding (DIPE) is proposed, a simple but effective mechanism that disentangles position encoding based on modality interactions and effectively mitigates the inter-modal distance-based penalty.

Abstract

Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.
Paper Structure (25 sections, 10 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 25 sections, 10 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: Illustration of the visual fading phenomenon. Left: A qualitative example from DocVQA docvqa. While the model accurately grounds visual evidence in a short-context scenario, it fails to preserve correct attention to the image and generates a wrong answer in a long-context one. Right: Quantitative analysis of visual fading. As the inter-modal distance increases, the proportion of attention allocated to visual tokens exhibits a sharp decay, indicating that the model gradually looks away from the image.
  • Figure 2: Overview of inter-modal Distance Invariant Position Encoding (DIPE). DIPE mitigates visual fading by disentangling position encoding based on modality interactions. Specifically, intra-modal attention applies sequential position encoding to both queries and keys to preserve spatial and sequential structures. Conversely, inter-modal attention utilizes anchored position encoding for queries alongside sequential position encoding for keys, effectively anchoring the inter-modal perceptual distance.
  • Figure 3: Accuracy across varying inter-modal distances. We compare the baseline MRoPE against MRoPE + DIPE on 9 benchmarks with distractor lengths ranging from 0K to 32K tokens.
  • Figure 4: Performance on the Short-Context VQA protocol. DIPE serves as a non-destructive enhancement, maintaining the performance on standard VQA benchmarks.
  • Figure 5: Performance on the image reasoning task in MM-NIAH niah.
  • ...and 5 more figures