Table of Contents
Fetching ...

Cost-Sensitive Learning for Long-Tailed Temporal Action Segmentation

Zhanzhong Pang, Fadime Sener, Shrinivas Ramasubramanian, Angela Yao

TL;DR

The paper tackles long-tail challenges in temporal action segmentation by identifying a bi-level learning bias: class-level bias from imbalance and transition-level bias from uneven transition frequencies. It introduces learning-state-aware constrained optimization and a cost-sensitive loss that adaptively weights frames based on action and transition learning states, cast as a Lagrangian min–max problem. A transition confusion tensor defines per-class and per-transition learning states, guiding constraints and reweighting, while Segment Nearest Class Mean (S-NCM) helps stabilize inference. Across Breakfast, 50Salads, and Assembly101 with MSTCN, ASFormer, and DiffAct backbones, the approach yields strong per-class improvements, especially for tail classes, without sacrificing global metrics, and demonstrates enhanced transition-detection capability and robust tail balancing. These contributions offer a practical route to balanced temporal segmentation in long-tailed real-world videos, with manageable computational overhead during training.

Abstract

Temporal action segmentation in untrimmed procedural videos aims to densely label frames into action classes. These videos inherently exhibit long-tailed distributions, where actions vary widely in frequency and duration. In temporal action segmentation approaches, we identified a bi-level learning bias. This bias encompasses (1) a class-level bias, stemming from class imbalance favoring head classes, and (2) a transition-level bias arising from variations in transitions, prioritizing commonly observed transitions. As a remedy, we introduce a constrained optimization problem to alleviate both biases. We define learning states for action classes and their associated transitions and integrate them into the optimization process. We propose a novel cost-sensitive loss function formulated as a weighted cross-entropy loss, with weights adaptively adjusted based on the learning state of actions and their transitions. Experiments on three challenging temporal segmentation benchmarks and various frameworks demonstrate the effectiveness of our approach, resulting in significant improvements in both per-class frame-wise and segment-wise performance.

Cost-Sensitive Learning for Long-Tailed Temporal Action Segmentation

TL;DR

The paper tackles long-tail challenges in temporal action segmentation by identifying a bi-level learning bias: class-level bias from imbalance and transition-level bias from uneven transition frequencies. It introduces learning-state-aware constrained optimization and a cost-sensitive loss that adaptively weights frames based on action and transition learning states, cast as a Lagrangian min–max problem. A transition confusion tensor defines per-class and per-transition learning states, guiding constraints and reweighting, while Segment Nearest Class Mean (S-NCM) helps stabilize inference. Across Breakfast, 50Salads, and Assembly101 with MSTCN, ASFormer, and DiffAct backbones, the approach yields strong per-class improvements, especially for tail classes, without sacrificing global metrics, and demonstrates enhanced transition-detection capability and robust tail balancing. These contributions offer a practical route to balanced temporal segmentation in long-tailed real-world videos, with manageable computational overhead during training.

Abstract

Temporal action segmentation in untrimmed procedural videos aims to densely label frames into action classes. These videos inherently exhibit long-tailed distributions, where actions vary widely in frequency and duration. In temporal action segmentation approaches, we identified a bi-level learning bias. This bias encompasses (1) a class-level bias, stemming from class imbalance favoring head classes, and (2) a transition-level bias arising from variations in transitions, prioritizing commonly observed transitions. As a remedy, we introduce a constrained optimization problem to alleviate both biases. We define learning states for action classes and their associated transitions and integrate them into the optimization process. We propose a novel cost-sensitive loss function formulated as a weighted cross-entropy loss, with weights adaptively adjusted based on the learning state of actions and their transitions. Experiments on three challenging temporal segmentation benchmarks and various frameworks demonstrate the effectiveness of our approach, resulting in significant improvements in both per-class frame-wise and segment-wise performance.

Paper Structure

This paper contains 11 sections, 1 theorem, 17 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

Given a timestamp $t$ and its previous action $u_t$, the optimal classifier of for a gain matrix $G \in \mathbb{R}^{L \times L \times L+1}$ and the $t^{\text{th}}$ frame takes the form: where $p_i(X)$ is the estimated conditional probability for class $i$ at the current frame $t$ by classifier $f$.

Figures (7)

  • Figure 1: (a) Long-tail action distribution on Breakfast kuehne2014language. The long-tail distribution results in low accuracy on tail actions with AsFormer yi2021asformer. (b) Left: Head-tail loss curve shows slow convergence rate on tail actions, demonstrating the class-level learning bias. Right: Action 'take_eggs' from tail shows skewed transition distribution(pie chart), i.e., different transitions from { 'SIL', 'pour_oil', 'take_bowl'} to 'take_eggs', and transition learning bias(loss curve, where common transition from 'pour_oil' are better learned than 'take_bowl')
  • Figure 2: The t-SNE of the frame-wise representations for a video of making cereal exhibits a strong temporal continuity.
  • Figure 3: Transition-based confusion tensor.
  • Figure 4: Transition accuracy and Lagrangian multiplier, $\lambda$, curves during training using AsFormer on Breakfast.
  • Figure 5: Class-wise accuracy distribution.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof