SFGANS Self-supervised Future Generator for human ActioN Segmentation
Or Berman, Adam Goldbraikh, Shlomi Laufer
TL;DR
The paper tackles long untrimmed video action segmentation by inserting a self-supervised future-feature generator (SFGANS) midway in the standard pipeline to refine feature representations before segmentation. The generator uses a retrospective cycle-GAN framework to predict short-horizon future feature vectors from past features, trained with a cycle-consistent adversarial objective and a sequence-prediction loss, and it outputs refined features for downstream models. Across temporal, online, and timestamp-supervised segmentation tasks and multiple datasets, employing the predicted future features yields consistent improvements over baselines without hyperparameter tuning, with additional gains achievable through hyperparameter optimization or all-data self-supervised training. The approach demonstrates practical benefits for diverse backbones (e.g., MS-TCN++, ASFormer, DTGRM) and datasets, suggesting a robust, generalizable boost to action-segmentation performance with minimal labeling overhead and modest computational overhead for real-time settings.
Abstract
The ability to locate and classify action segments in long untrimmed video is of particular interest to many applications such as autonomous cars, robotics and healthcare applications. Today, the most popular pipeline for action segmentation is composed of encoding the frames into feature vectors, which are then processed by a temporal model for segmentation. In this paper we present a self-supervised method that comes in the middle of the standard pipeline and generated refined representations of the original feature vectors. Experiments show that this method improves the performance of existing models on different sub-tasks of action segmentation, even without additional hyper parameter tuning.
