Joint Self-Supervised Video Alignment and Action Segmentation

Ali Shah Ali; Syed Ahmed Mahmood; Mubin Saeed; Andrey Konin; M. Zeeshan Zia; Quoc-Huy Tran

Joint Self-Supervised Video Alignment and Action Segmentation

Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

TL;DR

The paper tackles temporal video alignment and action segmentation under a self-supervised setting. It introduces VAOT, a fused Gromov-Wasserstein OT approach with structural priors for robust, global video alignment, and extends it to VASOT, a unified OT framework for joint alignment and segmentation in a single model. The methods handle order variations, background frames, and repeated actions, achieving state-of-the-art or strong results on multiple benchmarks while reducing training time and memory compared to separate models. The work is reportedly the first to unify video alignment and action segmentation in a single model, with code released publicly.

Abstract

We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model. Our code is available on our research website: https://retrocausal.ai/research/.

Joint Self-Supervised Video Alignment and Action Segmentation

TL;DR

Abstract

Joint Self-Supervised Video Alignment and Action Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)