Table of Contents
Fetching ...

Condensing Action Segmentation Datasets via Generative Network Inversion

Guodong Ding, Rongyu Chen, Angela Yao

TL;DR

The paper addresses the storage burden of procedural TAS datasets by introducing a condensation framework that learns a generative prior via the Temporally Coherent Action (TCA) model and uses network inversion to encode segments into compact latent codes. It adds a diversity-based sequence sampling strategy to further reduce redundancy, enabling substantial storage savings (e.g., >500× on Breakfast) while preserving competitive segmentation performance. The approach is validated across multiple TAS benchmarks and backbones, with additional gains demonstrated in incremental TAS settings. Overall, this work provides a practical, scalable solution for condensing TAS data, with strong implications for efficient training and continual learning in video understanding.

Abstract

This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500$\times$ while retaining 83% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.

Condensing Action Segmentation Datasets via Generative Network Inversion

TL;DR

The paper addresses the storage burden of procedural TAS datasets by introducing a condensation framework that learns a generative prior via the Temporally Coherent Action (TCA) model and uses network inversion to encode segments into compact latent codes. It adds a diversity-based sequence sampling strategy to further reduce redundancy, enabling substantial storage savings (e.g., >500× on Breakfast) while preserving competitive segmentation performance. The approach is validated across multiple TAS benchmarks and backbones, with additional gains demonstrated in incremental TAS settings. Overall, this work provides a practical, scalable solution for condensing TAS data, with strong implications for efficient training and continual learning in video understanding.

Abstract

This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500 while retaining 83% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.

Paper Structure

This paper contains 15 sections, 14 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of action segmentation performance with dataset storage across common action segmentation benchmarks at different scales. Our method effectively reduces dataset storage while retaining competitive performance to the original setup.
  • Figure 2: Generative Feature and Temporal Condensation Framework. (a) The generative action model is a conditional VAE that is trained to reconstruct the input frames conditioned on the action class label and a coherence variable. (b) The network inversion aims to optimize between decoded and original segments. Randomly sampled latent codes $z_1$ and $z_2$ are first inflated over time to the segment length, then concatenated with the action label and coherence variable for decoding. During the optimization, only the latent codes get updated while the decoder always stays fixed. These optimized latent codes $z_1^*$ and $z_2^*$ are stored as the condensed representation of the original segment. indicates parameter updates during learning, while the indicates that the parameter is kept frozen.
  • Figure 3: T-SNE visualization of original and decoded video features. Different colors indicate different action classes. The visualization shows that our generated features are well-aligned with original features. Best viewed when zoomed in.