Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
Ming Xu, Stephen Gould
TL;DR
This work tackles unsupervised action segmentation in long, untrimmed videos by formulating a post-processing step called ASOT that decodes temporally coherent segmentations from noisy frame-to-action affinities. ASOT solves a fused, unbalanced Gromov-Wasserstein OT problem, combining a ground-cost for visual similarity with a structure-aware GW prior that enforces temporal consistency while allowing actions to appear in varying orders and with unequal frequencies. The approach yields state-of-the-art results on Breakfast, 50 Salads, YouTube Instructions, and Desktop Assembly in the unsupervised setting and also improves supervised methods when used as a post-processing step. The method is efficient on GPUs via a projected mirror descent solver and supports generating pseudo-labels for self-training, enabling scalable utilization of large video collections.
Abstract
We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.
