Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Ming Xu; Stephen Gould

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Ming Xu, Stephen Gould

TL;DR

This work tackles unsupervised action segmentation in long, untrimmed videos by formulating a post-processing step called ASOT that decodes temporally coherent segmentations from noisy frame-to-action affinities. ASOT solves a fused, unbalanced Gromov-Wasserstein OT problem, combining a ground-cost for visual similarity with a structure-aware GW prior that enforces temporal consistency while allowing actions to appear in varying orders and with unequal frequencies. The approach yields state-of-the-art results on Breakfast, 50 Salads, YouTube Instructions, and Desktop Assembly in the unsupervised setting and also improves supervised methods when used as a post-processing step. The method is efficient on GPUs via a projected mirror descent solver and supports generating pseudo-labels for self-training, enabling scalable utilization of large video collections.

Abstract

We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

TL;DR

Abstract

Paper Structure (45 sections, 8 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 45 sections, 8 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Fully Supervised Action Segmentation.
Unsupervised Action Segmentation.
Optimal Transport for Structured Prediction.
Optimal Transport on Structured Data
Preliminaries.
Kantorovich Optimal Transport
Gromov-Wasserstein Optimal Transport
Fused GW Optimal Transport
Unbalanced Optimal Transport
Action Segmentation Optimal Transport
Optimal Transport for Action Segmentation
Objective Function Formulation
Visual Information.
...and 30 more sections

Figures (8)

Figure 1: High-level overview of our action segmentation method, ASOT. Given a (noisy) cost/affinity matrix between video frames and actions, ASOT solves an optimal transport (OT) problem to yield temporally consistent segmentations.
Figure 2: A raw frame/action affinity matrix in a) is decoded using ASOT into a temporally consistent segmentation in b). Removing GW from b) destroys temporal consistency, shown in c). Forcing a balanced assignment to actions in b) yields temporally consistent, but unintuitive results, shown in d).
Figure 3: The unsupervised training pipeline. Orange are learnable parameters, and arrows indicate computation/gradient flow.
Figure 4: Example action segmentations for videos in the same activity category with differing action orderings. Different colors correspond to different actions. Videos of complex, multi-stage activities can be exhibit markedly different action orderings.
Figure 5: Sensitivity analysis on FUGW OT hyperparameters.
...and 3 more figures

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

TL;DR

Abstract

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)