Table of Contents
Fetching ...

Improving action segmentation via explicit similarity measurement

Kamel Aouaidjia, Wenhao Zhang, Aofan Li, Chongsheng Zhang

TL;DR

This work tackles action segmentation by introducing ASESM, which injects explicit similarity measurement across frames to improve boundary localization and segmentation accuracy beyond frame-wise predictions alone. The method uses four parallel transformers processing multi-resolution frame features, combines their predictions via similarity voting to form an initial label sequence, and then iteratively corrects boundaries based on feature similarity before refining with temporal convolutions. It additionally adds a fully unsupervised boundary detection-correction algorithm that relies solely on feature similarity, providing a training-free alternative. Empirical results on 50Salads, GTEA, and Breakfast demonstrate superior or competitive performance for both supervised and unsupervised settings, with ablations confirming the effectiveness of similarity voting, boundary correction, and smoothing, and code available on GitHub.

Abstract

Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed boundary correction algorithm that operates based on feature similarity between consecutive frames to adjust the boundary locations iteratively through the learning process. The corrected prediction is then further refined through multiple stages of temporal convolutions. As post-processing, we optionally apply boundary correction again followed by a segment smoothing method that removes outlier classes within segments using similarity measurement between consecutive predictions. Additionally, we propose a fully unsupervised boundary detection-correction algorithm that identifies segment boundaries based solely on feature similarity without any training. Experiments on 50Salads, GTEA, and Breakfast datasets show the effectiveness of both the supervised and unsupervised algorithms. Code and models are made available on Github.

Improving action segmentation via explicit similarity measurement

TL;DR

This work tackles action segmentation by introducing ASESM, which injects explicit similarity measurement across frames to improve boundary localization and segmentation accuracy beyond frame-wise predictions alone. The method uses four parallel transformers processing multi-resolution frame features, combines their predictions via similarity voting to form an initial label sequence, and then iteratively corrects boundaries based on feature similarity before refining with temporal convolutions. It additionally adds a fully unsupervised boundary detection-correction algorithm that relies solely on feature similarity, providing a training-free alternative. Empirical results on 50Salads, GTEA, and Breakfast demonstrate superior or competitive performance for both supervised and unsupervised settings, with ablations confirming the effectiveness of similarity voting, boundary correction, and smoothing, and code available on GitHub.

Abstract

Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed boundary correction algorithm that operates based on feature similarity between consecutive frames to adjust the boundary locations iteratively through the learning process. The corrected prediction is then further refined through multiple stages of temporal convolutions. As post-processing, we optionally apply boundary correction again followed by a segment smoothing method that removes outlier classes within segments using similarity measurement between consecutive predictions. Additionally, we propose a fully unsupervised boundary detection-correction algorithm that identifies segment boundaries based solely on feature similarity without any training. Experiments on 50Salads, GTEA, and Breakfast datasets show the effectiveness of both the supervised and unsupervised algorithms. Code and models are made available on Github.

Paper Structure

This paper contains 22 sections, 9 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: Framework of our supervised learning architecture for explicit similarity measurement. We involve three levels of similarities: Multi-resolution prediction similarity voting, boundary correction based on frame-wise feature similarity, and segment smoothing based on frame-wise prediction similarity.
  • Figure 2: Encoder structure (left) and Temporal Convolution Block:TCB (right), which are modified versions of the ASFormer yi2021asformer encoder and a single stage of MS-TCN farha2019ms, respectively. The components added are in green color.
  • Figure 3: Visual illustration of the boundary correction algorithm.
  • Figure 4: Segment smoothing: Two shifting windows are used. The smoothing happens within the first window only. The second window is used to check the location of the boundary.
  • Figure 5: Qualitative results of supervised action segmentation on testing videos from split 1 of 50Salads dataset (left), and from split 2 of Breakfast dataset (right). GT: Ground truth, Pr: Initial prediction, BC+Smth: Applying boundary correction and smoothing. Corrected predictions are marked with green rectangles.