Table of Contents
Fetching ...

Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets

Xulin Gu, Xinhao Zhong, Zhixing Wei, Yimin Zhou, Shuoyang Sun, Bin Chen, Hongpeng Wang, Yuan Luo

TL;DR

This paper proposes a novel uni-level video dataset distillation framework that directly optimizes synthetic videos with respect to a pre-trained model and introduces a temporal saliency-guided filtering mechanism that leverages inter-frame differences to guide the distillation process.

Abstract

Dataset distillation (DD) has emerged as a powerful paradigm for dataset compression, enabling the synthesis of compact surrogate datasets that approximate the training utility of large-scale ones. While significant progress has been achieved in distilling image datasets, extending DD to the video domain remains challenging due to the high dimensionality and temporal complexity inherent in video data. Existing video distillation (VD) methods often suffer from excessive computational costs and struggle to preserve temporal dynamics, as naïve extensions of image-based approaches typically lead to degraded performance. In this paper, we propose a novel uni-level video dataset distillation framework that directly optimizes synthetic videos with respect to a pre-trained model. To address temporal redundancy and enhance motion preservation, we introduce a temporal saliency-guided filtering mechanism that leverages inter-frame differences to guide the distillation process, encouraging the retention of informative temporal cues while suppressing frame-level redundancy. Extensive experiments on standard video benchmarks demonstrate that our method achieves state-of-the-art performance, bridging the gap between real and distilled video data and offering a scalable solution for video dataset compression.

Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets

TL;DR

This paper proposes a novel uni-level video dataset distillation framework that directly optimizes synthetic videos with respect to a pre-trained model and introduces a temporal saliency-guided filtering mechanism that leverages inter-frame differences to guide the distillation process.

Abstract

Dataset distillation (DD) has emerged as a powerful paradigm for dataset compression, enabling the synthesis of compact surrogate datasets that approximate the training utility of large-scale ones. While significant progress has been achieved in distilling image datasets, extending DD to the video domain remains challenging due to the high dimensionality and temporal complexity inherent in video data. Existing video distillation (VD) methods often suffer from excessive computational costs and struggle to preserve temporal dynamics, as naïve extensions of image-based approaches typically lead to degraded performance. In this paper, we propose a novel uni-level video dataset distillation framework that directly optimizes synthetic videos with respect to a pre-trained model. To address temporal redundancy and enhance motion preservation, we introduce a temporal saliency-guided filtering mechanism that leverages inter-frame differences to guide the distillation process, encouraging the retention of informative temporal cues while suppressing frame-level redundancy. Extensive experiments on standard video benchmarks demonstrate that our method achieves state-of-the-art performance, bridging the gap between real and distilled video data and offering a scalable solution for video dataset compression.

Paper Structure

This paper contains 26 sections, 10 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of test accuracy and distillation cost between different methods across all the IPC settings. Our method exhibits superior performance.
  • Figure 2: Comparison between our TSGF framework and the traditional two-stage video distillation paradigm. While both stages of the traditional approach primarily rely on pixel-level information, our unified framework effectively distills temporal information through a three-stage process. TSGF comprises two key components: TSGF$_O$and TSGF$_A$. During the optimization stage, TSGF$_O$ constrains the optimization by computing inter-frame differences. In the evaluation stage, TSGF$_A$ guides the data augmentation process to preserve temporal dynamics.
  • Figure 3: Experimental results on MiniUCF under IPC=1 with varying numbers of frames.
  • Figure 4: Experimental results on MiniUCF under IPC=1 with varying numbers of frames.
  • Figure 5: Optical flows of MiniUCF.
  • ...and 1 more figures