Table of Contents
Fetching ...

A Large-Scale Study on Video Action Dataset Condensation

Yang Chen, Sheng Guo, Bo Zheng, Limin Wang

TL;DR

The paper tackles the problem of condensing large-scale video action datasets by extending three representative condensation approaches to the space-time domain, alongside a unified evaluation protocol. It introduces temporal processing with sliding-window sampling and analyzes labeling, augmentation, and loss choices, revealing that labeling methods often dominate performance while temporal design shapes consistency and efficiency. The study presents comprehensive ablations across four action datasets (HMDB51, UCF101, SSv2, K400), showing that dataset distillation methods excel on harder datasets while sample selection can perform well on easier ones, and achieves state-of-the-art results under the proposed protocol. The work enables data-efficient video action recognition at scale and provides practical guidance on algorithm choice and evaluation for future video condensation research.

Abstract

Recently, dataset condensation has made significant progress in the image domain. Unlike images, videos possess an additional temporal dimension, which harbors considerable redundant information, making condensation even more crucial. However, video dataset condensation still remains an underexplored area. We aim to bridge this gap by providing a large-scale study with systematic design and fair comparison. Specifically, our work delves into three key aspects to provide valuable empirical insights: (1) temporal processing of video data, (2) the evaluation protocol for video dataset condensation, and (3) adaptation of condensation algorithms to the space-time domain. From this study, we derive several intriguing observations: (i) labeling methods greatly influence condensation performance, (ii) simple sliding-window sampling is effective for temporal processing, and (iii) dataset distillation methods perform better in challenging scenarios, while sample selection methods excel in easier ones. Furthermore, we propose a unified evaluation protocol for the fair comparison of different condensation algorithms and achieve state-of-the-art results on four widely-used action recognition datasets: HMDB51, UCF101, SSv2 and K400. Our code is available at https://github.com/MCG-NJU/Video-DC.

A Large-Scale Study on Video Action Dataset Condensation

TL;DR

The paper tackles the problem of condensing large-scale video action datasets by extending three representative condensation approaches to the space-time domain, alongside a unified evaluation protocol. It introduces temporal processing with sliding-window sampling and analyzes labeling, augmentation, and loss choices, revealing that labeling methods often dominate performance while temporal design shapes consistency and efficiency. The study presents comprehensive ablations across four action datasets (HMDB51, UCF101, SSv2, K400), showing that dataset distillation methods excel on harder datasets while sample selection can perform well on easier ones, and achieves state-of-the-art results under the proposed protocol. The work enables data-efficient video action recognition at scale and provides practical guidance on algorithm choice and evaluation for future video condensation research.

Abstract

Recently, dataset condensation has made significant progress in the image domain. Unlike images, videos possess an additional temporal dimension, which harbors considerable redundant information, making condensation even more crucial. However, video dataset condensation still remains an underexplored area. We aim to bridge this gap by providing a large-scale study with systematic design and fair comparison. Specifically, our work delves into three key aspects to provide valuable empirical insights: (1) temporal processing of video data, (2) the evaluation protocol for video dataset condensation, and (3) adaptation of condensation algorithms to the space-time domain. From this study, we derive several intriguing observations: (i) labeling methods greatly influence condensation performance, (ii) simple sliding-window sampling is effective for temporal processing, and (iii) dataset distillation methods perform better in challenging scenarios, while sample selection methods excel in easier ones. Furthermore, we propose a unified evaluation protocol for the fair comparison of different condensation algorithms and achieve state-of-the-art results on four widely-used action recognition datasets: HMDB51, UCF101, SSv2 and K400. Our code is available at https://github.com/MCG-NJU/Video-DC.
Paper Structure (20 sections, 5 equations, 6 figures, 11 tables)

This paper contains 20 sections, 5 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Pipeline of our study. Our study includes three core elements: temporal processing, condensation algorithms and evaluation settings. Temporal processing is the special design for video data, including sampling and interpolation. Real videos comes across a condensation algorithm (categorized into sample selection and dataset distillation, with G indicating the chosen approach) to form the synthetic dataset (left). Then, the synthetic videos combined with the labeling methods train an evaluation network to test on val set as the evaluation metric (right). Both of the phases rely on a neural network.
  • Figure 2: Temporal processing encompasses sampling and interpolation. (a) Naive sampling directly treats the video as the sampled clip. (b) Segment sampling views videos as independent clips. (c) Sliding-window sampling sequentially samples clips along time. Different interpolation methods can then be applied to generate the final input clips.
  • Figure 3: Conceptual visualization of three dataset condensation frameworks applied to video. Trajectory Matching (a) and Distribution Matching (b) in the blue boxes belong to dataset distillation methods, and Score Selection (c) in the green box belongs to sample selection methods. Dataset distillation defines a proxy task to help condense the dataset and uses the output proxy loss (left) to update the condensed dataset, while sample selection directly draws samples from the real dataset to form the condensed one. Trajectory Matching (a) uses a normed L2 loss to describe distance between parameters, Distribution Matching (b) matches the distribution between model layers, and Score Selection (c) scores each sample from the real dataset and concatenates them to form the condensed dataset.
  • Figure 4: Performance curve of various condensation algorithms with different labeling methods on UCF101, illustrating the comparison among condensation algorithms and highlighting the significant impact of labeling methods.
  • Figure 5: Visualization of RDED for UCF101 IPC=1
  • ...and 1 more figures