Table of Contents
Fetching ...

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen

TL;DR

MARC tackles the high computational cost of video understanding under long context by introducing a retrieve-then-compress framework that uses a Visual Memory Retriever (VMR) to select semantically coherent event fragments and a post-training compression policy, C-GRPO, to distill 64-frame teacher reasoning into a 1-frame student. The method achieves a dramatic 95% reduction in visual tokens (down to 4.71% of the original) with a 72% memory saving and 23.9% latency reduction, while preserving near-baseline accuracy across six benchmarks. Key innovations include event-based video segmentation for structured memory, memory-aware temporal compression that respects event boundaries, and a reinforcement-learning-based distillation objective that retains teaching-quality under compression via a retention reward. This yields a practical, real-time capable approach for resource-constrained video understanding tasks such as video QA, surveillance, and autonomous driving.

Abstract

The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

TL;DR

MARC tackles the high computational cost of video understanding under long context by introducing a retrieve-then-compress framework that uses a Visual Memory Retriever (VMR) to select semantically coherent event fragments and a post-training compression policy, C-GRPO, to distill 64-frame teacher reasoning into a 1-frame student. The method achieves a dramatic 95% reduction in visual tokens (down to 4.71% of the original) with a 72% memory saving and 23.9% latency reduction, while preserving near-baseline accuracy across six benchmarks. Key innovations include event-based video segmentation for structured memory, memory-aware temporal compression that respects event boundaries, and a reinforcement-learning-based distillation objective that retains teaching-quality under compression via a retention reward. This yields a practical, real-time capable approach for resource-constrained video understanding tasks such as video QA, surveillance, and autonomous driving.

Abstract

The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.

Paper Structure

This paper contains 14 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of Visual Memory Retriever (VMR). In the first stage, we processed the original video using two approaches: (1) Employing a visual memory retriever to search the video, subsequently reconstructing a new video and compressing its visual features; (2) Sampling the original video before feeding it into a visual encoder to obtain uncompressed visual features.
  • Figure 2: Overview of Compression Group Relative Policy Optimization (C-GRPO). Here, O denotes the outputs from different groups, R the corresponding rewards, and A the normalized advantages. A compression reward $r_c$ is introduced to encourage compressed inputs to retain the reasoning ability of the uncompressed teacher model.
  • Figure 3: Distribution of benchmarks based on the number of QA samples.
  • Figure 4: Distribution of training dataset based on number of QA samples.
  • Figure 5: Vision tokens for each benchmark and MARC compared with the baseline performance.
  • ...and 1 more figures