ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

Xiao Wang; Qingyi Si; Jianlong Wu; Shiyu Zhu; Li Cao; Liqiang Nie

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie

TL;DR

RETAKE tackles the challenge of long-video understanding in VideoLLMs by jointly reducing temporal and knowledge redundancy without additional training. It introduces DPSelect for perceptual keyframe selection and PivotKV for knowledge-aware KV-cache compression, enabling up to 8× longer sequences under fixed memory and modest latency overhead. The approach yields consistent gains over day-one baselines and competitive advantages over larger models across major long-video benchmarks, while maintaining practical efficiency through overlapping computation. Overall, RETAKE establishes a scalable, training-free framework to push VideoLLMs toward substantially longer temporal horizons with tangible performance and efficiency benefits.

Abstract

Video Large Language Models (VideoLLMs) have made significant strides in video understanding but struggle with long videos due to the limitations of their backbone LLMs. Existing solutions rely on length extrapolation, which is memory-constrained, or visual token compression, which primarily leverages low-level temporal redundancy while overlooking the more effective high-level knowledge redundancy. To address this, we propose $\textbf{ReTaKe}$, a training-free method with two novel modules DPSelect and PivotKV, to jointly reduce both temporal visual redundancy and knowledge redundancy for video compression. To align with the way of human temporal perception, DPSelect identifies keyframes based on inter-frame distance peaks. To leverage LLMs' learned prior knowledge, PivotKV marks the keyframes as pivots and compress non-pivot frames by pruning low-attention tokens in their KV cache. ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), outperforming similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench. Moreover, by overlapping compression operations with prefilling, ReTaKe introduces only ~10% prefilling latency overhead while reducing decoding latency by ~20%. Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

TL;DR

Abstract

, a training-free method with two novel modules DPSelect and PivotKV, to jointly reduce both temporal visual redundancy and knowledge redundancy for video compression. To align with the way of human temporal perception, DPSelect identifies keyframes based on inter-frame distance peaks. To leverage LLMs' learned prior knowledge, PivotKV marks the keyframes as pivots and compress non-pivot frames by pruning low-attention tokens in their KV cache. ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), outperforming similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench. Moreover, by overlapping compression operations with prefilling, ReTaKe introduces only ~10% prefilling latency overhead while reducing decoding latency by ~20%. Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.

Paper Structure (25 sections, 9 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Video Large Language Models
Long Video Understanding
Method
Overview
Preliminaries: Chunked Prefill
DPSelect
PivotKV
Efficiency Optimization
Experiments
Benchmarks and Implementations
Implementation Details.
Main Results
Comparison with SoTAs.
...and 10 more sections

Figures (6)

Figure 1: RETAKE effectively compresses video sequence in VideoLLMs, allowing longer perception and improved performance within a fixed memory budget (measured by context length).
Figure 2: Illustration of ReTaKe. DPSelect select keyframes. Video sequence is then processed chunk by chunk, during which PivotKV compresses the KV cache of video tokens.
Figure 3: An example of DPSelect. Distance represents cosine dissimilarity between the $i$-th and $i+1$-th frames.
Figure 4: Efficiency optimization through overlapping compression operations with prefilling. $\text{S}_1, \text{S}_2$ represent different CUDA streams. $F_i^l,C_i^l$ denote prefilling and compression operations, respectively, for chunk $i$ in the $l$-th layer.
Figure 5: (a) Ablation study on DPSelect and PivotKV under different compression ratios. (b) Trade-off between knowledge and temporal redundancy—RETAKE favors leveraging knowledge redundancy.
...and 1 more figures

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

TL;DR

Abstract

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (6)