MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen
TL;DR
MARC tackles the high computational cost of video understanding under long context by introducing a retrieve-then-compress framework that uses a Visual Memory Retriever (VMR) to select semantically coherent event fragments and a post-training compression policy, C-GRPO, to distill 64-frame teacher reasoning into a 1-frame student. The method achieves a dramatic 95% reduction in visual tokens (down to 4.71% of the original) with a 72% memory saving and 23.9% latency reduction, while preserving near-baseline accuracy across six benchmarks. Key innovations include event-based video segmentation for structured memory, memory-aware temporal compression that respects event boundaries, and a reinforcement-learning-based distillation objective that retains teaching-quality under compression via a retention reward. This yields a practical, real-time capable approach for resource-constrained video understanding tasks such as video QA, surveillance, and autonomous driving.
Abstract
The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.
