A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking
Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera
TL;DR
This work reveals that mainstream self-supervised visual representations largely ignore procedural order in video data. It introduces PL-Stitch, a two-branch framework that uses Plackett-Luce listwise ranking to learn global procedural progression from frame sequences and to capture fine-grained cross-frame cues via a spatio-temporal jigsaw and masked image modeling, all trained on a shared ViT backbone. The approach achieves state-of-the-art results on five procedural benchmarks, with substantial gains in surgical phase recognition and cooking action segmentation, demonstrating the practical value of explicitly modeling temporal order. The work also provides extensive ablations and qualitative analyses showing robust, interpretable representations that align with real procedural structure. Future directions include action anticipation and multi-modal alignment with textual procedural guides.
Abstract
Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.
