Table of Contents
Fetching ...

Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Syed Ahmed Mahmood, Ali Shah Ali, Umer Ahmed, Fawad Javed Fateh, M. Zeeshan Zia, Quoc-Huy Tran

TL;DR

The paper tackles learning the sequence of key steps from unlabeled instructional videos by aligning frames across demonstrations with a fused Gromov-Wasserstein OT (FGWOT) that incorporates a structural prior. To avoid degenerate mappings where all frames collapse to a single cluster, it adds a Contrastive Inverse Difference Moment (C-IDM) regularization, resulting in Regularized Gromov-Wasserstein OT (RGWOT). Empirical results on EgoProceL, ProceL, and CrossTask show RGWOT achieving state-of-the-art performance, outperforming OPEL and other baselines, with ablations highlighting the critical roles of priors, virtual frames, and regularization. The approach yields robust frame-to-frame alignment, reliable key-step localization, and accurate ordering, offering a data-efficient path for procedure learning with potential impact on robotics and instructional video analysis.

Abstract

We study self-supervised procedure learning, which discovers key steps and their order from a set of unlabeled videos. Previous methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised framework, which utilizes a fused Gromov-Wasserstein optimal transport with a structural prior for frame-to-frame mapping. However, optimizing only for the above temporal alignment may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and thus every video is assigned to just one key step. To address that issue, we integrate a contrastive regularization, which maps different frames to various points, avoiding trivial solutions. Finally, extensive experiments on egocentric and third-person benchmarks demonstrate our superior performance over prior works, including OPEL which relies on a classical Kantorovich optimal transport with an optimality prior.

Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

TL;DR

The paper tackles learning the sequence of key steps from unlabeled instructional videos by aligning frames across demonstrations with a fused Gromov-Wasserstein OT (FGWOT) that incorporates a structural prior. To avoid degenerate mappings where all frames collapse to a single cluster, it adds a Contrastive Inverse Difference Moment (C-IDM) regularization, resulting in Regularized Gromov-Wasserstein OT (RGWOT). Empirical results on EgoProceL, ProceL, and CrossTask show RGWOT achieving state-of-the-art performance, outperforming OPEL and other baselines, with ablations highlighting the critical roles of priors, virtual frames, and regularization. The approach yields robust frame-to-frame alignment, reliable key-step localization, and accurate ordering, offering a data-efficient path for procedure learning with potential impact on robotics and instructional video analysis.

Abstract

We study self-supervised procedure learning, which discovers key steps and their order from a set of unlabeled videos. Previous methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised framework, which utilizes a fused Gromov-Wasserstein optimal transport with a structural prior for frame-to-frame mapping. However, optimizing only for the above temporal alignment may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and thus every video is assigned to just one key step. To address that issue, we integrate a contrastive regularization, which maps different frames to various points, avoiding trivial solutions. Finally, extensive experiments on egocentric and third-person benchmarks demonstrate our superior performance over prior works, including OPEL which relies on a classical Kantorovich optimal transport with an optimality prior.

Paper Structure

This paper contains 18 sections, 7 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Procedure learning methods usually learn video frame representations via temporal video alignment (a). The learned embeddings are then used for extracting key steps and their order (b). In this work, we rely on a regularized Gromov-Wasserstein optimal transport formulation for tackling order variations, background/redundant frames, and repeated actions in (a), yielding state-of-the-art results in (b).
  • Figure 2: Our approach incorporates a fused Gromov-Wasserstein optimal transport formulation with a structural prior for establishing frame-to-frame correspondences between videos with a contrastive regularization for avoiding degenerate solutions. Forward/backward arrows denote computation/gradient flows. Blue and orange/green represent temporal alignment and contrastive regularization.
  • Figure 3: Degenerate solutions by ali2025joint across four different subtasks of ProceL elhamifar2019unsupervised. Ground truth and results by ali2025joint are shown at top and bottom rows respectively.
  • Figure 4: Qualitative results on ProceL elhamifar2019unsupervised. Colored segments represent predicted actions with a particular color denoting the same action across all models.
  • Figure 5: Effects of training data quantity on F1 score across different methods on MECCANO ragusa2021meccano.
  • ...and 3 more figures