Table of Contents
Fetching ...

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang

TL;DR

InfiniPot-V tackles the KV-cache memory bottleneck in streaming video understanding by introducing a training-free, query-agnostic continual KV cache compression framework. It combines Temporal-axis Redundancy (TaR) and Value Norm (VaN) to prune redundant tokens and preserve semantically salient ones under a fixed memory budget, enabling on-device SVU without retraining. Across multiple open-source MLLMs and six long-video benchmarks, InfiniPot-V achieves up to 94% peak memory reduction while maintaining or exceeding full-cache accuracy and real-time generation, including challenging multi-turn dialogues. This approach removes the KV-cache bottleneck for edge devices, enabling practical on-device streaming video assistants with broad applicability to memory-constrained environments.

Abstract

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time-quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

TL;DR

InfiniPot-V tackles the KV-cache memory bottleneck in streaming video understanding by introducing a training-free, query-agnostic continual KV cache compression framework. It combines Temporal-axis Redundancy (TaR) and Value Norm (VaN) to prune redundant tokens and preserve semantically salient ones under a fixed memory budget, enabling on-device SVU without retraining. Across multiple open-source MLLMs and six long-video benchmarks, InfiniPot-V achieves up to 94% peak memory reduction while maintaining or exceeding full-cache accuracy and real-time generation, including challenging multi-turn dialogues. This approach removes the KV-cache bottleneck for edge devices, enabling practical on-device streaming video assistants with broad applicability to memory-constrained environments.

Abstract

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time-quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

Paper Structure

This paper contains 58 sections, 11 equations, 7 figures, 13 tables, 2 algorithms.

Figures (7)

  • Figure 1: MLLMs Video Understanding and Compression. (a) OVU pipeline; (b) IVC: compresses vision tokens after encoding; (c) KVC: compresses KV cache after prefill; (d) CKV: iteratively processes and compresses KV caches to constrain memory usage; (e) Accuracy vs. GPU memory consumption for compression across four token reduction ratios (50%, 25%, 12.5%, 6.25%) on MLVU using Qwen-2-VL-7B. LongVUshen2024longvu is used for IVC, SnapKVli2024snapkv for KVC; (f) GPU memory usage as input video stream length increases. IVC/KVC/CKV target a 6K cache; Sampling uses 1/4 of input frames. Measured on A100 80GB single GPU.
  • Figure 2: Spatio-Temporal KV cache Compression (TaR and VaN). (a) Temporal redundancy across adjacent frames, showing static patches that can be evicted from past frames; (b) Layer-wise cosine similarity of Key/Value embeddings for static patches between consecutive frames in LLaVA-Next-Video-7B; (c) InfiniPot-V performs query-agnostic spatiotemporal compression, reducing temporal redundancy with TaR and selecting tokens via VaN spatial scoring.
  • Figure 3: Value Norm (VaN) Analysis. (a) Entropy analysis of vision token representations grouped by their VaN scores. (b) VideoMME performance under varying cache compression ratios using either VaN or reverse-VaN for token selection. (c) Layer-wise locality of VaN, measured by center distance and coefficient of variation (CV); lower values indicate stronger spatial consistency. LLaVA-Next-7B with Video-MME used.
  • Figure 4: KV cache Compression (KVC) methods evaluation results with offline long video understanding tasks under Continual KV Cache Compression (CKV) framework. Performance across four compression ratios (1/16, 1/8, 1/4, 1/2) for LLaVA-Next-7B (top row) and Qwen-2-VL-7B (bottom row) on VideoMME, MLVU$_\text{dev}$, and LongVideoBench$_\text{dev}$ (LVB$_\text{dev}$) tasks. The full evaluation results are shown in Table \ref{['tab:full_table']}.
  • Figure A1: Qualitative Results of Multi-Turn Conversation: Full-KV uses 16K cache while InfiniPot-V and SnapKV employ 3K compressed KV cache. SnapKV performs query-guided cache compression based on Q1 before proceeding with multi-turn conversation. The video sample is from the MLVU ego reasoning subtask, using the Qwen-2-VL-7B model. 128 frame sampling is used.
  • ...and 2 more figures