Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
Syed Ahmed Mahmood, Ali Shah Ali, Umer Ahmed, Fawad Javed Fateh, M. Zeeshan Zia, Quoc-Huy Tran
TL;DR
The paper tackles learning the sequence of key steps from unlabeled instructional videos by aligning frames across demonstrations with a fused Gromov-Wasserstein OT (FGWOT) that incorporates a structural prior. To avoid degenerate mappings where all frames collapse to a single cluster, it adds a Contrastive Inverse Difference Moment (C-IDM) regularization, resulting in Regularized Gromov-Wasserstein OT (RGWOT). Empirical results on EgoProceL, ProceL, and CrossTask show RGWOT achieving state-of-the-art performance, outperforming OPEL and other baselines, with ablations highlighting the critical roles of priors, virtual frames, and regularization. The approach yields robust frame-to-frame alignment, reliable key-step localization, and accurate ordering, offering a data-efficient path for procedure learning with potential impact on robotics and instructional video analysis.
Abstract
We study self-supervised procedure learning, which discovers key steps and their order from a set of unlabeled videos. Previous methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised framework, which utilizes a fused Gromov-Wasserstein optimal transport with a structural prior for frame-to-frame mapping. However, optimizing only for the above temporal alignment may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and thus every video is assigned to just one key step. To address that issue, we integrate a contrastive regularization, which maps different frames to various points, avoiding trivial solutions. Finally, extensive experiments on egocentric and third-person benchmarks demonstrate our superior performance over prior works, including OPEL which relies on a classical Kantorovich optimal transport with an optimality prior.
