InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Minsoo Kim; Kyuhong Shim; Jungwook Choi; Simyung Chang

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang

TL;DR

InfiniPot-V tackles the KV-cache memory bottleneck in streaming video understanding by introducing a training-free, query-agnostic continual KV cache compression framework. It combines Temporal-axis Redundancy (TaR) and Value Norm (VaN) to prune redundant tokens and preserve semantically salient ones under a fixed memory budget, enabling on-device SVU without retraining. Across multiple open-source MLLMs and six long-video benchmarks, InfiniPot-V achieves up to 94% peak memory reduction while maintaining or exceeding full-cache accuracy and real-time generation, including challenging multi-turn dialogues. This approach removes the KV-cache bottleneck for edge devices, enabling practical on-device streaming video assistants with broad applicability to memory-constrained environments.

Abstract

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time-quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

TL;DR

Abstract

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)