Table of Contents
Fetching ...

Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification

Minghao Qin, Xiangrui Liu, Zhengyang Liang, Yan Shu, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu

TL;DR

Video-XL-2 tackles the high resource demands of long-video understanding by introducing task-aware KV sparsification, combining chunk-based pre-filling with bi-level KV decoding to drastically reduce FLOPs and memory while maintaining or surpassing state-of-the-art performance on LVU benchmarks. The model leverages a DTS-enabled architecture with a SigLIP vision encoder and a Qwen-2.5-7B LLM, enriched with explicit timestamp tokens to improve temporal reasoning. Its four-stage incremental training and comprehensive efficiency optimizations enable processing thousands to up to 10,000 frames on a single GPU, achieving strong results across long-video benchmarks and temporal grounding tasks. The approach provides a practical, scalable solution for real-world long-video understanding with favorable speed-accuracy-efficiency trade-offs.

Abstract

Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value decoding selectively reloads either dense or sparse key-values for each chunk based on its relevance to the task. This approach further improves memory efficiency and enhances the model's ability to capture fine-grained information. Video-XL-2 achieves state-of-the-art performance on various long video understanding benchmarks, outperforming existing open-source lightweight models. It also demonstrates exceptional efficiency, capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.

Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification

TL;DR

Video-XL-2 tackles the high resource demands of long-video understanding by introducing task-aware KV sparsification, combining chunk-based pre-filling with bi-level KV decoding to drastically reduce FLOPs and memory while maintaining or surpassing state-of-the-art performance on LVU benchmarks. The model leverages a DTS-enabled architecture with a SigLIP vision encoder and a Qwen-2.5-7B LLM, enriched with explicit timestamp tokens to improve temporal reasoning. Its four-stage incremental training and comprehensive efficiency optimizations enable processing thousands to up to 10,000 frames on a single GPU, achieving strong results across long-video benchmarks and temporal grounding tasks. The approach provides a practical, scalable solution for real-world long-video understanding with favorable speed-accuracy-efficiency trade-offs.

Abstract

Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value decoding selectively reloads either dense or sparse key-values for each chunk based on its relevance to the task. This approach further improves memory efficiency and enhances the model's ability to capture fine-grained information. Video-XL-2 achieves state-of-the-art performance on various long video understanding benchmarks, outperforming existing open-source lightweight models. It also demonstrates exceptional efficiency, capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.

Paper Structure

This paper contains 15 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The architecture of Video-XL-2. The proposed Video-XL-2 comprises four main components: (1) Vision encoder to encode images and videos, (2) DTS to compress and make initial temporal modeling on visual features from vision encoder. (3) an MLP projector to project visual features into LLM embedding, and (4) a Large Language Model to process multi-modal inputs. Video-XL-2 interleaves timestamp tokens within the visual token sequence to enhance the model’s temporal awareness. Additionally, single image inputs are repeated four times to align with the video modality.
  • Figure 2: Chunk-based Pre-filling Illustration. In chunk-based pre-filling, the current processing chunk only attend to itself, historical timestamp tokens and the system prompt, as depicted in the left subfigure. And the right subfigure illustrates the current chunks for processing are decided by a sliding chunk window.
  • Figure 3: Bi-level KVs decoding. Bi-level KVs comprise both dense KVs (derived from the full video input) and sparse KVs, where the latter are obtained by downsampling the former at a chunk level. During decoding, it selectively reloads either dense KVs for video chunks highly relevant to the specific task query text, or sparse KVs for less relevant chunks, optimizing memory while preserving critical information.
  • Figure 4: Efficiency Analysis Illustration. All results measured using eager attention for fair comparison.
  • Figure 5: Needle in Haystack Evaluation.