Table of Contents
Fetching ...

CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Shrenik Patel, Daivik Patel

TL;DR

The paper addresses the challenge of long-form video understanding with vision-language models whose attention and KV caches scale poorly over long videos. It introduces CacheFlow, a training-free framework that combines Dynamic Token Dropping to prune redundant patches with a GRU-based compressive memory to store compact summaries of past context, enabling efficient live streaming VQA. A consensus-based retrieval mechanism rehydrates only the most relevant memory blocks, allowing the model to attend to both recent and retrieved context with bounded computation. Across offline and streaming benchmarks, CacheFlow matches or surpasses strong baselines while dramatically reducing token usage and latency, demonstrating practical, scalable long-form video understanding without fine-tuning.

Abstract

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.

CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

TL;DR

The paper addresses the challenge of long-form video understanding with vision-language models whose attention and KV caches scale poorly over long videos. It introduces CacheFlow, a training-free framework that combines Dynamic Token Dropping to prune redundant patches with a GRU-based compressive memory to store compact summaries of past context, enabling efficient live streaming VQA. A consensus-based retrieval mechanism rehydrates only the most relevant memory blocks, allowing the model to attend to both recent and retrieved context with bounded computation. Across offline and streaming benchmarks, CacheFlow matches or surpasses strong baselines while dramatically reducing token usage and latency, demonstrating practical, scalable long-form video understanding without fine-tuning.

Abstract

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.

Paper Structure

This paper contains 34 sections, 9 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: CacheFlow system overview. Dynamic Token Dropping prunes redundant tokens via inter-frame similarity, and surviving tokens are packed into fixed-size blocks. Each block is summarized by a GRU-based compressive memory, while full key–value pairs are offloaded. During inference, consensus-first retrieval rehydrates only the Top-$K$ relevant blocks for efficient long-range reasoning.
  • Figure 2: Qualitative example from MLVU. Visualizes how Dynamic Token Dropping (DTD) preserves only the salient regions (green overlays) corresponding to the suspect interacting with the vehicle. Note the first frame is fully preserved by default.