Table of Contents
Fetching ...

Joint Self-Supervised Video Alignment and Action Segmentation

Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

TL;DR

The paper tackles temporal video alignment and action segmentation under a self-supervised setting. It introduces VAOT, a fused Gromov-Wasserstein OT approach with structural priors for robust, global video alignment, and extends it to VASOT, a unified OT framework for joint alignment and segmentation in a single model. The methods handle order variations, background frames, and repeated actions, achieving state-of-the-art or strong results on multiple benchmarks while reducing training time and memory compared to separate models. The work is reportedly the first to unify video alignment and action segmentation in a single model, with code released publicly.

Abstract

We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model. Our code is available on our research website: https://retrocausal.ai/research/.

Joint Self-Supervised Video Alignment and Action Segmentation

TL;DR

The paper tackles temporal video alignment and action segmentation under a self-supervised setting. It introduces VAOT, a fused Gromov-Wasserstein OT approach with structural priors for robust, global video alignment, and extends it to VASOT, a unified OT framework for joint alignment and segmentation in a single model. The methods handle order variations, background frames, and repeated actions, achieving state-of-the-art or strong results on multiple benchmarks while reducing training time and memory compared to separate models. The work is reportedly the first to unify video alignment and action segmentation in a single model, with code released publicly.

Abstract

We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model. Our code is available on our research website: https://retrocausal.ai/research/.

Paper Structure

This paper contains 23 sections, 7 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: (a) Our self-supervised video alignment method (VAOT) based on a fused Gromov-Wasserstein optimal transport with structural priors $\{\textbf{C}^x,\textbf{C}^y\}$. (b) Our joint self-supervised video alignment and action segmentation method (VASOT) based on a unified optimal transport with structural priors $\{\textbf{C}^x,\textbf{C}^y\}$ for video alignment and $\{\textbf{C}^x,\textbf{C}^a\}$ and $\{\textbf{C}^y,\textbf{C}^a\}$ for action segmentation.
  • Figure 2: (a) Our self-supervised video alignment method (VAOT). (b) Our joint self-supervised video alignment and action segmentation method (VASOT). Learnable parameters are shown in red. Arrows denote computation/gradient flows (blue and green represent video alignment and action segmentation respectively).
  • Figure 3: Sensitivity analysis results. Note that (a-d) are for VAOT, while (e-f) are for VASOT.
  • Figure 4: Fine-grained frame retrieval results on Penn Action. The query image is on the left, while on the right are the top 5 matching images retrieved by VAOT (blue box) and LAV (red box).
  • Figure 5: Action segmentation results on Breakfast (top) and YouTube Instructions (bottom).
  • ...and 4 more figures