Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment
Zhanzhong Pang, Fadime Sener, Shrinivas Ramasubramanian, Angela Yao
TL;DR
This work addresses the long-tail problem in temporal action segmentation of procedural videos by introducing Group-wise Temporal Logit Adjustment (G-TLA). G-TLA combines activity-conditioned group-wise classification with a temporally-aware logit adjustment that leverages action order priors, reducing tail-class false positives while preserving head-class performance. The method introduces a two-stage GTLA loss and an inference procedure that selects the appropriate action group, yielding improved frame-level and segment-level metrics across multiple datasets and backbones, with ablations confirming the contributions of group-wise classification and temporal priors. The results demonstrate stronger tail-action recognition and better balanced performance, suggesting practical impact for robust understanding of complex procedural videos in real-world settings.
Abstract
Procedural activity videos often exhibit a long-tailed action distribution due to varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods overlook the long tail and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and struggle to identify tail classes when applied to temporal segmentation frameworks. This work proposes a novel group-wise temporal logit adjustment~(G-TLA) framework that combines a group-wise softmax formulation while leveraging activity information and action ordering for logit adjustment. The proposed framework significantly improves in segmenting tail actions without any performance loss on head actions.
