Table of Contents
Fetching ...

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Ming Xu, Stephen Gould

TL;DR

This work tackles unsupervised action segmentation in long, untrimmed videos by formulating a post-processing step called ASOT that decodes temporally coherent segmentations from noisy frame-to-action affinities. ASOT solves a fused, unbalanced Gromov-Wasserstein OT problem, combining a ground-cost for visual similarity with a structure-aware GW prior that enforces temporal consistency while allowing actions to appear in varying orders and with unequal frequencies. The approach yields state-of-the-art results on Breakfast, 50 Salads, YouTube Instructions, and Desktop Assembly in the unsupervised setting and also improves supervised methods when used as a post-processing step. The method is efficient on GPUs via a projected mirror descent solver and supports generating pseudo-labels for self-training, enabling scalable utilization of large video collections.

Abstract

We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

TL;DR

This work tackles unsupervised action segmentation in long, untrimmed videos by formulating a post-processing step called ASOT that decodes temporally coherent segmentations from noisy frame-to-action affinities. ASOT solves a fused, unbalanced Gromov-Wasserstein OT problem, combining a ground-cost for visual similarity with a structure-aware GW prior that enforces temporal consistency while allowing actions to appear in varying orders and with unequal frequencies. The approach yields state-of-the-art results on Breakfast, 50 Salads, YouTube Instructions, and Desktop Assembly in the unsupervised setting and also improves supervised methods when used as a post-processing step. The method is efficient on GPUs via a projected mirror descent solver and supports generating pseudo-labels for self-training, enabling scalable utilization of large video collections.

Abstract

We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.
Paper Structure (45 sections, 8 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 45 sections, 8 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: High-level overview of our action segmentation method, ASOT. Given a (noisy) cost/affinity matrix between video frames and actions, ASOT solves an optimal transport (OT) problem to yield temporally consistent segmentations.
  • Figure 2: A raw frame/action affinity matrix in a) is decoded using ASOT into a temporally consistent segmentation in b). Removing GW from b) destroys temporal consistency, shown in c). Forcing a balanced assignment to actions in b) yields temporally consistent, but unintuitive results, shown in d).
  • Figure 3: The unsupervised training pipeline. Orange are learnable parameters, and arrows indicate computation/gradient flow.
  • Figure 4: Example action segmentations for videos in the same activity category with differing action orderings. Different colors correspond to different actions. Videos of complex, multi-stage activities can be exhibit markedly different action orderings.
  • Figure 5: Sensitivity analysis on FUGW OT hyperparameters.
  • ...and 3 more figures