Table of Contents
Fetching ...

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

Shaobo Wang, Tianle Niu, Runkang Yang, Deshan Liu, Xu He, Zichen Wen, Conghui He, Xuming Hu, Linfeng Zhang

TL;DR

VideoCompressa tackles the data-efficiency challenge in video understanding by reframing dataset compression as a dynamic latent compression problem. It jointly learns which frames to keep and how to encode them in a compact latent space using a differentiable Gumbel-Softmax keyframe selector and a frozen VAE encoder, enabling end-to-end optimization with the downstream task. The approach achieves unprecedented data efficiency, surpassing full-data baselines on several benchmarks and enabling lossless compression for multimodal LLM fine-tuning with only a fraction of the data. This work demonstrates that prioritizing intra-sample temporal information via differentiable frame selection yields large gains in efficiency and generalization, with strong practical implications for scalable video-language modeling.

Abstract

The scalability of video understanding models is increasingly limited by the prohibitive storage and computational costs of large-scale video datasets. While data synthesis has improved data efficiency in the image domain, its extension to video remains challenging due to pervasive temporal redundancy and complex spatiotemporal dynamics. In this work, we uncover a critical insight: the primary source of inefficiency in video datasets is not inter-sample redundancy, but intra-sample frame-level redundancy. To leverage this insight, we introduce VideoCompressa, a novel framework for video data synthesis that reframes the problem as dynamic latent compression. Specifically, VideoCompressa jointly optimizes a differentiable keyframe selector-implemented as a lightweight ConvNet with Gumbel-Softmax sampling-to identify the most informative frames, and a pretrained, frozen Variational Autoencoder (VAE) to compress these frames into compact, semantically rich latent codes. These latent representations are then fed into a compression network, enabling end-to-end backpropagation. Crucially, the keyframe selector and synthetic latent codes are co-optimized to maximize retention of task-relevant information. Experiments show that our method achieves unprecedented data efficiency: on UCF101 with ConvNets, VideoCompressa surpasses full-data training by 2.34\% points using only 0.13\% of the original data, with over 5800x speedup compared to traditional synthesis method. Moreover, when fine-tuning Qwen2.5-7B-VL on HMDB51, VideoCompressa matches full-data performance using just 0.41\% of the training data-outperforming zero-shot baseline by 10.61\%.

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

TL;DR

VideoCompressa tackles the data-efficiency challenge in video understanding by reframing dataset compression as a dynamic latent compression problem. It jointly learns which frames to keep and how to encode them in a compact latent space using a differentiable Gumbel-Softmax keyframe selector and a frozen VAE encoder, enabling end-to-end optimization with the downstream task. The approach achieves unprecedented data efficiency, surpassing full-data baselines on several benchmarks and enabling lossless compression for multimodal LLM fine-tuning with only a fraction of the data. This work demonstrates that prioritizing intra-sample temporal information via differentiable frame selection yields large gains in efficiency and generalization, with strong practical implications for scalable video-language modeling.

Abstract

The scalability of video understanding models is increasingly limited by the prohibitive storage and computational costs of large-scale video datasets. While data synthesis has improved data efficiency in the image domain, its extension to video remains challenging due to pervasive temporal redundancy and complex spatiotemporal dynamics. In this work, we uncover a critical insight: the primary source of inefficiency in video datasets is not inter-sample redundancy, but intra-sample frame-level redundancy. To leverage this insight, we introduce VideoCompressa, a novel framework for video data synthesis that reframes the problem as dynamic latent compression. Specifically, VideoCompressa jointly optimizes a differentiable keyframe selector-implemented as a lightweight ConvNet with Gumbel-Softmax sampling-to identify the most informative frames, and a pretrained, frozen Variational Autoencoder (VAE) to compress these frames into compact, semantically rich latent codes. These latent representations are then fed into a compression network, enabling end-to-end backpropagation. Crucially, the keyframe selector and synthetic latent codes are co-optimized to maximize retention of task-relevant information. Experiments show that our method achieves unprecedented data efficiency: on UCF101 with ConvNets, VideoCompressa surpasses full-data training by 2.34\% points using only 0.13\% of the original data, with over 5800x speedup compared to traditional synthesis method. Moreover, when fine-tuning Qwen2.5-7B-VL on HMDB51, VideoCompressa matches full-data performance using just 0.41\% of the training data-outperforming zero-shot baseline by 10.61\%.

Paper Structure

This paper contains 20 sections, 2 theorems, 23 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Let $q_1, \dots, q_K$ be a set of logits. Let $g_1, \dots, g_K$ be i.i.d. random variables from the standard Gumbel distribution. Then:

Figures (8)

  • Figure 1: Comparison of performance and computation efficiency across different compressing methods under multiple compression ratios on UCF101 dataset. Bubble size indicates peak GPU memory usage during compression. VideoCompressa achieves substantially better performance with dramatically lower computation cost, reaching over 5800× speedup compared to prior compression methods such as DM, MTT, and LVDD. The dashed orange line marks full-data supervised training.
  • Figure 2: An overview of our proposed pipeline. A raw video is first processed by a Gumbel-based keyframe selection module. The selected keyframes are then encoded into the latent space by a frozen VAE. Finally, these latent representations are optimized via latent space synthesis, guided by the gradients from a student model, to form the synthetic dataset.
  • Figure 3: Classification accuracy (%) for data compression on HMDB51 and UCF101 video benchmarks under different ratio settings. All methods use a randomly initialized ConvNet and frame scorer to guide reconstruction, with accuracy measured by finetuning pre-trained Qwen2.5-VL-7B from scratch on the resulting condensed data.
  • Figure 4: Ablation study of individual components. The figure compares the impact of three Frame Selection methods, three VAE variants, and the inclusion of the video understanding module. Experiments are conducted on UCF101 with a 0.13% data ratio. FS indicates frame selection method.
  • Figure 5: Sensitivity analysis of training steps on HMDB51 and UCF101. All experiments were conducted under a fixed setting: selecting 4 frames with the Gumbel-based strategy.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Definition 1: Data-Efficient Video Understanding
  • Theorem 1: The Gumbel-Max Trick
  • proof
  • Theorem 2: Gradient Approximation Bound
  • proof