Table of Contents
Fetching ...

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu

TL;DR

Vision-Language Models struggle with long-context inputs due to limitations in extending Rotary Position Embedding (RoPE) to multimodal, spatial-temporal data. The paper provides a theoretical analysis showing that vanilla RoPE distorts 3D locality and that existing multimodal RoPE frequency allocations cannot preserve semantic preference over long contexts. It then introduces HoPE, a Hybrid of Position Embedding that uses a Hybrid Frequency Allocation with zeroed temporal frequencies and a Dynamic Temporal Scaling mechanism to enable robust learning across varying video speeds and context lengths. Empirical results across four long-video benchmarks show HoPE consistently outperforms baselines, with notable gains in long video retrieval (22.23%) and understanding (8.35%), validating its effectiveness for long-context vision-language modeling.

Abstract

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

TL;DR

Vision-Language Models struggle with long-context inputs due to limitations in extending Rotary Position Embedding (RoPE) to multimodal, spatial-temporal data. The paper provides a theoretical analysis showing that vanilla RoPE distorts 3D locality and that existing multimodal RoPE frequency allocations cannot preserve semantic preference over long contexts. It then introduces HoPE, a Hybrid of Position Embedding that uses a Hybrid Frequency Allocation with zeroed temporal frequencies and a Dynamic Temporal Scaling mechanism to enable robust learning across varying video speeds and context lengths. Empirical results across four long-video benchmarks show HoPE consistently outperforms baselines, with notable gains in long video retrieval (22.23%) and understanding (8.35%), validating its effectiveness for long-context vision-language modeling.

Abstract

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.

Paper Structure

This paper contains 23 sections, 5 theorems, 26 equations, 5 figures, 4 tables.

Key Result

Proposition 3.1

Given any query $\mathbf{q}$ at position $(t,x,y)$ and a relative distance of 1 in spatial or temporal dimensions, the flattening operation in 1D RoPE distorts the relative distance with a magnitude dependent on the frame resolution.

Figures (5)

  • Figure 1: Comparison of our HoPE and existing methods. Upper plots illustrate the frequency allocation strategies in different RoPE variants. Here, frequency decreases along the diagonal. (d) HoPE sets the lowest frequencies to zero for reliable long-range semantic modeling. Lower plots demonstrate different temporal scaling mechanisms. (d) HoPE proposes dynamic and bidirectional scaling to learn temporal dynamics at multiple scales, facilitating robustness to various video speeds.
  • Figure 2: Multimodal RoPEs use different frequencies for temporal modeling. M-RoPE uses the highest frequencies, which are suboptimal for long-context modeling. VideoRoPE utilizes the lowest frequencies for more stable semantic modeling. Our HoPE, employing zero frequencies for temporal modeling, establishes the upper bound of semantic modeling capabilities across all strategies.
  • Figure 3: Performance comparison on long video retrieval task (V-NIAH). Here, each frame corresponds to 144 tokens. Cell colors indicate model accuracy (red: low, green: high), and the black dotted line marks the training context length (8k).
  • Figure 4: Ablation results on Video-MME from 8k to 64k. Here, HFA: hybrid frequency allocation, DTS: dynamic temporal scaling.
  • Figure 5: Illustration of V-NIAH, which consists of a randomly inserted needle image, a haystack video, and a specific question related to the needle.

Theorems & Definitions (9)

  • Proposition 3.1: 1D RoPE violates spatial-temporal locality priors
  • Definition 3.1: Semantic Preference
  • Theorem 3.1
  • Lemma 4.1: Necessary Condition for Semantic Preference
  • Theorem 4.1
  • proof
  • Lemma A.1
  • proof : Proof of Lemma \ref{['lem:neg_prob_monotone']}
  • proof