Table of Contents
Fetching ...

Latent Video Dataset Distillation

Ning Li, Antai Andy Liu, Jingran Zhang, Justin Cui

TL;DR

This work addresses the inefficiency of video dataset distillation by shifting distillation to latent space using a state-of-the-art VAE. It combines Diversity-Aware Data Selection via Determinantal Point Processes, training-free latent compression with High-Order Singular Value Decomposition, and two-stage VAE quantization to maximize storage efficiency while preserving temporal dynamics. The approach delivers state-of-the-art results across MiniUCF, HMDB51, Kinetics-400, and SSv2 under multiple IPC budgets, with substantial gains over pixel-space methods. By reducing memory usage and expediting distillation without retraining, the method offers a scalable pathway for efficient video dataset condensation in practical settings.

Abstract

Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase. Our code is available at https://github.com/liningresearch/Latent_Video_Dataset_Distillation.

Latent Video Dataset Distillation

TL;DR

This work addresses the inefficiency of video dataset distillation by shifting distillation to latent space using a state-of-the-art VAE. It combines Diversity-Aware Data Selection via Determinantal Point Processes, training-free latent compression with High-Order Singular Value Decomposition, and two-stage VAE quantization to maximize storage efficiency while preserving temporal dynamics. The approach delivers state-of-the-art results across MiniUCF, HMDB51, Kinetics-400, and SSv2 under multiple IPC budgets, with substantial gains over pixel-space methods. By reducing memory usage and expediting distillation without retraining, the method offers a scalable pathway for efficient video dataset condensation in practical settings.

Abstract

Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase. Our code is available at https://github.com/liningresearch/Latent_Video_Dataset_Distillation.

Paper Structure

This paper contains 27 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Our training-free latent video distillation pipeline. The entire video dataset is encoded into latent space with a VAE. We further employ the DPPs to select both representative and diverse samples, followed by latent space compression with HOSVD for efficient storage.
  • Figure 2: Comparison between different dataset distillation methods and data sampling methods on the SSv2 when IPC is 1.
  • Figure 3: Accuracies of HMDB51 (IPC 1) and MiniUCF (IPC 1) under different rank compression ratios utilized in HOSVD.
  • Figure 4: Inter-frame comparison between DM and our method. Our frames are reconstructed from saved tensors and decoded by a 3D-VAE.
  • Figure 5: Architecture of Variational Autoencoder(VAE).
  • ...and 1 more figures