Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement

Ziyu Wang; Yue Xu; Cewu Lu; Yong-Lu Li

Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement

Ziyu Wang, Yue Xu, Cewu Lu, Yong-Lu Li

TL;DR

This paper addresses the challenge of video dataset distillation by systematically studying temporal condensation and proposing a taxonomy across four factors: the number of synthetic frames $N_{syn}$, the number of real frames $N_{real}$, the number of segments $K$, and the interpolation algorithm $\mathcal{I}$. It reveals that temporal information is often underutilized in distillation and that dense temporal data yields diminishing returns, motivating a static-dynamic disentanglement: first distill static memory from still frames, then compensate motion with a learnable dynamic memory block $\mathcal{H}$. The authors demonstrate state-of-the-art performance on multiple video benchmarks while using substantially reduced storage (often under $50\%$ of the baseline), and show that their approach generalizes across architectures and scales. This work offers a practical route to memory-efficient video distillation and provides a foundation for further exploration of temporal condensation strategies in large-scale video datasets.

Abstract

Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation, and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with a notably smaller memory storage budget. Our code is available at https://github.com/yuz1wan/video_distillation.

Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement

TL;DR

This paper addresses the challenge of video dataset distillation by systematically studying temporal condensation and proposing a taxonomy across four factors: the number of synthetic frames

, the number of real frames

, the number of segments

, and the interpolation algorithm

. It reveals that temporal information is often underutilized in distillation and that dense temporal data yields diminishing returns, motivating a static-dynamic disentanglement: first distill static memory from still frames, then compensate motion with a learnable dynamic memory block

. The authors demonstrate state-of-the-art performance on multiple video benchmarks while using substantially reduced storage (often under

of the baseline), and show that their approach generalizes across architectures and scales. This work offers a practical route to memory-efficient video distillation and provides a foundation for further exploration of temporal condensation strategies in large-scale video datasets.

Abstract

Paper Structure (36 sections, 13 figures, 12 tables, 1 algorithm)

This paper contains 36 sections, 13 figures, 12 tables, 1 algorithm.

Introduction
Related work
Pre-analysis
Preliminaries
Segmented Matching and Interpolation
Comparison of Temporal Compression
Methodology
Static Learning
Dynamic Fine-tuning
Experiments
Datasets and Metrics
Baselines
Implementation Details
Results
Ablation Study
...and 21 more sections

Figures (13)

Figure 1: (a) Naive video distillation methods simply match the training dynamics (gradient, feature, trajectory, etc.) of the real and synthetic videos. (b) To exploit the temporal redundancy of videos, we propose a paradigm with segmented matching and interpolation techniques to cover all levels of temporal condensation. (c) Based on this paradigm and our observations, we propose an approach of efficient static frame distillation and motion compensation, with better efficiency and performance.
Figure 2: The distillation setting in (a) obeys temporal consistency, while (b) and (c) violate the two consistency preconditions.
Figure 3: Left: Different types of video distillation that obey temporal consistency. Right: A basic framework for compressed video distillation, which condenses the temporal dimension by distillation, and interpolates the synthetic frames to the target length.
Figure 4: Examples with different compression levels. (a) use an image distillation algorithm to distill the frames one by one. (b) distills the video into a single image.
Figure 5: Model performance (a) and efficiency (b, c) comparison with different independent synthetic frames and real frames number $N_{syn}$, $N_{real}$, with DM DM and ConvNet+GRU.
...and 8 more figures

Theorems & Definitions (1)

Definition 1

Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement

TL;DR

Abstract

Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (1)