Table of Contents
Fetching ...

SFGANS Self-supervised Future Generator for human ActioN Segmentation

Or Berman, Adam Goldbraikh, Shlomi Laufer

TL;DR

The paper tackles long untrimmed video action segmentation by inserting a self-supervised future-feature generator (SFGANS) midway in the standard pipeline to refine feature representations before segmentation. The generator uses a retrospective cycle-GAN framework to predict short-horizon future feature vectors from past features, trained with a cycle-consistent adversarial objective and a sequence-prediction loss, and it outputs refined features for downstream models. Across temporal, online, and timestamp-supervised segmentation tasks and multiple datasets, employing the predicted future features yields consistent improvements over baselines without hyperparameter tuning, with additional gains achievable through hyperparameter optimization or all-data self-supervised training. The approach demonstrates practical benefits for diverse backbones (e.g., MS-TCN++, ASFormer, DTGRM) and datasets, suggesting a robust, generalizable boost to action-segmentation performance with minimal labeling overhead and modest computational overhead for real-time settings.

Abstract

The ability to locate and classify action segments in long untrimmed video is of particular interest to many applications such as autonomous cars, robotics and healthcare applications. Today, the most popular pipeline for action segmentation is composed of encoding the frames into feature vectors, which are then processed by a temporal model for segmentation. In this paper we present a self-supervised method that comes in the middle of the standard pipeline and generated refined representations of the original feature vectors. Experiments show that this method improves the performance of existing models on different sub-tasks of action segmentation, even without additional hyper parameter tuning.

SFGANS Self-supervised Future Generator for human ActioN Segmentation

TL;DR

The paper tackles long untrimmed video action segmentation by inserting a self-supervised future-feature generator (SFGANS) midway in the standard pipeline to refine feature representations before segmentation. The generator uses a retrospective cycle-GAN framework to predict short-horizon future feature vectors from past features, trained with a cycle-consistent adversarial objective and a sequence-prediction loss, and it outputs refined features for downstream models. Across temporal, online, and timestamp-supervised segmentation tasks and multiple datasets, employing the predicted future features yields consistent improvements over baselines without hyperparameter tuning, with additional gains achievable through hyperparameter optimization or all-data self-supervised training. The approach demonstrates practical benefits for diverse backbones (e.g., MS-TCN++, ASFormer, DTGRM) and datasets, suggesting a robust, generalizable boost to action-segmentation performance with minimal labeling overhead and modest computational overhead for real-time settings.

Abstract

The ability to locate and classify action segments in long untrimmed video is of particular interest to many applications such as autonomous cars, robotics and healthcare applications. Today, the most popular pipeline for action segmentation is composed of encoding the frames into feature vectors, which are then processed by a temporal model for segmentation. In this paper we present a self-supervised method that comes in the middle of the standard pipeline and generated refined representations of the original feature vectors. Experiments show that this method improves the performance of existing models on different sub-tasks of action segmentation, even without additional hyper parameter tuning.
Paper Structure (25 sections, 8 equations, 2 figures, 8 tables)

This paper contains 25 sections, 8 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Full paper pipeline. (1) Frames are encoded into feature vectors using a feature extractor. (2) A prediction of the (n+i)-th vector is generated using a sequence of feature vectors. In this paper we implemented for i values of 1, 4, and 10. (3) Replacing the n-th feature vector with prediction n+i. Phases 2 and 3 are repeated for each vector. (4) The new predicted features replaces the original ones, and sent to the segmentation model.
  • Figure 2: The generator architecture. It is strongly based on the original generator architecture from kwon2019predicting with additions marked in red. In the figure, I-BN is instance batch norm, and k, n, and s denote the kernel size, channels, and stride respectively. A More detailed description of the architecture, including the architectures of the residual blocks and the discriminators' are found in kwon2019predicting. The notations are similar for convenience.