Table of Contents
Fetching ...

Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment

Zhanzhong Pang, Fadime Sener, Shrinivas Ramasubramanian, Angela Yao

TL;DR

This work addresses the long-tail problem in temporal action segmentation of procedural videos by introducing Group-wise Temporal Logit Adjustment (G-TLA). G-TLA combines activity-conditioned group-wise classification with a temporally-aware logit adjustment that leverages action order priors, reducing tail-class false positives while preserving head-class performance. The method introduces a two-stage GTLA loss and an inference procedure that selects the appropriate action group, yielding improved frame-level and segment-level metrics across multiple datasets and backbones, with ablations confirming the contributions of group-wise classification and temporal priors. The results demonstrate stronger tail-action recognition and better balanced performance, suggesting practical impact for robust understanding of complex procedural videos in real-world settings.

Abstract

Procedural activity videos often exhibit a long-tailed action distribution due to varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods overlook the long tail and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and struggle to identify tail classes when applied to temporal segmentation frameworks. This work proposes a novel group-wise temporal logit adjustment~(G-TLA) framework that combines a group-wise softmax formulation while leveraging activity information and action ordering for logit adjustment. The proposed framework significantly improves in segmenting tail actions without any performance loss on head actions.

Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment

TL;DR

This work addresses the long-tail problem in temporal action segmentation of procedural videos by introducing Group-wise Temporal Logit Adjustment (G-TLA). G-TLA combines activity-conditioned group-wise classification with a temporally-aware logit adjustment that leverages action order priors, reducing tail-class false positives while preserving head-class performance. The method introduces a two-stage GTLA loss and an inference procedure that selects the appropriate action group, yielding improved frame-level and segment-level metrics across multiple datasets and backbones, with ablations confirming the contributions of group-wise classification and temporal priors. The results demonstrate stronger tail-action recognition and better balanced performance, suggesting practical impact for robust understanding of complex procedural videos in real-world settings.

Abstract

Procedural activity videos often exhibit a long-tailed action distribution due to varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods overlook the long tail and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and struggle to identify tail classes when applied to temporal segmentation frameworks. This work proposes a novel group-wise temporal logit adjustment~(G-TLA) framework that combines a group-wise softmax formulation while leveraging activity information and action ordering for logit adjustment. The proposed framework significantly improves in segmenting tail actions without any performance loss on head actions.
Paper Structure (26 sections, 15 equations, 12 figures, 27 tables, 1 algorithm)

This paper contains 26 sections, 15 equations, 12 figures, 27 tables, 1 algorithm.

Figures (12)

  • Figure 1: "Making tea", with temporal segments indicated by colored bars. The tail action 'stir tea' is recognized by Logit adjustment (LA) and our G-TLA but not by the MSTCN backbone. However, LA overlooks the action order and activity, resulting in activity-irrelevant false positives such as 'take bowl' & 'stir coffee', and temporally illogical false positives like 'add teabag' occurring after 'stir tea'.
  • Figure 2: Temporal action segmentation datasets exhibit a long-tail distribution of actions due to varying frequencies of actions and action durations.
  • Figure 3: Our group-wise temporal logit adjustment framework consists of group-wise classification and temporal logit adjustment within the respective group. The temporal logit adjustment is only applied to the target group($G_1$ in this illustration).
  • Figure 4: Illustration of temporal logit adjustment for class $c=$'add teabag'. The adjustment only occurs within the temporal bounds.
  • Figure 5: Radar charts of different logit adjustment methods, measuring the performance along balanced and global metrics on Breakfast with MSTCN and AsFormer.
  • ...and 7 more figures