Table of Contents
Fetching ...

Visual Context Window Extension: A New Perspective for Long Video Understanding

Hongchen Wei, Zhenzhong Chen

TL;DR

The paper tackles the bottleneck of long video understanding in large multimodal models by separating visual and language context windows and extending the visual context window without retraining on long video data. It introduces YaRN-inspired visual context extension and a progressive pooling scheme to control memory usage while preserving essential spatial information. Across VideoMME, MLVU, and LongVideoBench, the approach yields consistent gains and, in the MLVU benchmark, surpasses GPT-4o with a 7B model, while achieving substantial memory reductions at 256 frames. This work enables practical long-video reasoning with existing open-source models and provides a resource-efficient pathway toward scalable long-video understanding.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

Visual Context Window Extension: A New Perspective for Long Video Understanding

TL;DR

The paper tackles the bottleneck of long video understanding in large multimodal models by separating visual and language context windows and extending the visual context window without retraining on long video data. It introduces YaRN-inspired visual context extension and a progressive pooling scheme to control memory usage while preserving essential spatial information. Across VideoMME, MLVU, and LongVideoBench, the approach yields consistent gains and, in the MLVU benchmark, surpasses GPT-4o with a 7B model, while achieving substantial memory reductions at 256 frames. This work enables practical long-video reasoning with existing open-source models and provides a resource-efficient pathway toward scalable long-video understanding.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.
Paper Structure (18 sections, 22 equations, 6 figures, 5 tables)

This paper contains 18 sections, 22 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1:
  • Figure 2:
  • Figure 4: Examples of RoPE embeddings under different context extension methods. Upper: RoPE directly extrapolated beyond the pre-training range. Middle: YaRN interpolating and extrapolating different RoPE dimensions beyond the pre-training range. Down: Our method further distinguishes between visual and language context windows in YaRN, allowing for different interpolation and extrapolation of RoPE dimensions.
  • Figure 5: Pipeline of progressive pooling strategy.
  • Figure 6: Visualization of the Needle in the Long Video Haystack Experiment, where green represents correct answers, while red indicates incorrect answers. Left: progressive pooling parameters are set to $s_h=2$, $s_l=8$, $K=4$. Right: progressive pooling parameters are set to $s_h=2$, $s_l=4$, $K=4$. Our method enables LMMs, pre-trained on short videos (32 frames), to be extended to 1024 frames without requiring fine-tuning.
  • ...and 1 more figures