Table of Contents
Fetching ...

Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation

Hinako Mitsuoka, Kazuhiro Hotta

Abstract

Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.

Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation

Abstract

Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the proposed method and its performance improvement. The left shows that the proposed method adds a single boundary output channel and introduces two auxiliary training losses that can be integrated into conventional TAS models with minimal modification. The right shows the relationship between the F1 score and the number of parameters on GTEA, illustrating improved performance with minimal parameter overhead.
  • Figure 2: Overview of the proposed training framework. In addition to the model-specific loss, our method introduces two complementary auxiliary losses: a boundary-regression loss using an additional boundary output channel and a CDF-based segment shape regularization loss. The losses are applied selectively to different temporal regions (boundary vs non-boundary) to reduce optimization conflicts, and can be combined with existing TAS models with minimal architectural modification.
  • Figure 3: Visualization of Boundary-Regression Loss $\mathcal{L}_B$. The additional output channel predicts a boundary probability curve (orange), which is supervised by a binary ground-truth boundary mask (top). The loss encourages high responses around class transitions while maintaining low values elsewhere. Segment labels (bottom) are color-coded for clarity.
  • Figure 4: Visualization of the CDF-based segment shape regularization loss $\mathcal{L}_{\text{S}}$. Each ground-truth segment (top) is compared with its corresponding predicted region (bottom) using cumulative distributions. The loss penalizes structural mismatches such as over-segmentation or fragmented predictions by measuring discrepancies between cumulative probability distributions within each segment.
  • Figure 5: Qualitative comparison on GTEA, 50Salads, and Breakfast using MS-TCN as the backbone. For each dataset, we show the ground-truth (GT), baseline MS-TCN predictions, and MS-TCN trained with the proposed losses.
  • ...and 2 more figures