Table of Contents
Fetching ...

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

TL;DR

This work reveals that mainstream self-supervised visual representations largely ignore procedural order in video data. It introduces PL-Stitch, a two-branch framework that uses Plackett-Luce listwise ranking to learn global procedural progression from frame sequences and to capture fine-grained cross-frame cues via a spatio-temporal jigsaw and masked image modeling, all trained on a shared ViT backbone. The approach achieves state-of-the-art results on five procedural benchmarks, with substantial gains in surgical phase recognition and cooking action segmentation, demonstrating the practical value of explicitly modeling temporal order. The work also provides extensive ablations and qualitative analyses showing robust, interpretable representations that align with real procedural structure. Future directions include action anticipation and multi-modal alignment with textual procedural guides.

Abstract

Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

TL;DR

This work reveals that mainstream self-supervised visual representations largely ignore procedural order in video data. It introduces PL-Stitch, a two-branch framework that uses Plackett-Luce listwise ranking to learn global procedural progression from frame sequences and to capture fine-grained cross-frame cues via a spatio-temporal jigsaw and masked image modeling, all trained on a shared ViT backbone. The approach achieves state-of-the-art results on five procedural benchmarks, with substantial gains in surgical phase recognition and cooking action segmentation, demonstrating the practical value of explicitly modeling temporal order. The work also provides extensive ablations and qualitative analyses showing robust, interpretable representations that align with real procedural structure. Future directions include action anticipation and multi-modal alignment with textual procedural guides.

Abstract

Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.

Paper Structure

This paper contains 31 sections, 9 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 2: Core concept of the PL-Stitch model and key results. (a) Our model, PL-Stitch, learns a procedurally-aware representation by optimizing for the negative log likelihood of the Plackett-Luce distribution, $-\log P(r^*|s)$, which maximizes the probability ($P$) of the ground-truth temporal order ($r^*$) given the model's predicted parameters ($s$). (b) Significant performance gains are shown for the Cholec80 phase recognition task.
  • Figure 3: Overview of the PL-Stitch framework. Our model jointly trains a shared backbone encoder ($f_{\theta}$) by using a Video branch (Sec. \ref{['sec:video_branch']}) for global workflow progression and an Image branch (Sec. \ref{['sec:image_branch']}) for fine-grained feature learning. The Video branch (top) treats time as order, training the encoder with a Plackett–Luce loss $\mathcal{L}_{\text{vid}}$ (Eq. \ref{['eq:pl_loss']}, Eq. \ref{['eq:vid_loss']}) to predict the correct relative chronological sequence of a sampled clip. The Image branch (bottom) learns robust local features by jointly optimizing a standard masked image modeling loss $\mathcal{L}_{\text{MIM}}$ with our novel spatio-temporal jigsaw $\mathcal{L}_{\text{jigsaw}}$, which learns object correspondence from adjacent frames (Eq. \ref{['eq:jigsaw_loss']}). The symbols $h_{\text{vid}}$, $h_{\text{MIM}}$, and $h_{\text{jigsaw}}$ denote task-specific projection heads. By optimizing all objectives, the shared backbone learns a powerful representation sensitive to both procedural order and fine-grained visual details. Best viewed online.
  • Figure 4: Structure of our $h_{\text{vid}}$ and $h_{\text{jigsaw}}$ heads. The video head ($h_{\text{vid}}$) consists of an MLP to reduce feature dimensionality for computational efficiency, a Transformer Encoder to aggregate global context across the $k$ frame features for ordering, and a final MLP that outputs the PL distribution parameters $s_{\text{clip}}$. The jigsaw head ($h_{\text{jigsaw}}$) uses Cross-Attention to aggregate temporal context (K, V) onto the target patches (Q), followed by Self-Attention for spatial relationships refinement, and a final MLP for producing PL distribution parameters $s_{\text{jigsaw}}$.
  • Figure 5: Visualization of linear probing predictions for phase recognition (top) and action segmentation (bottom). Each horizontal bar shows the frame-wise predictions, where the x-axis denotes progress over time and the colors represent different classes.
  • Figure 6: t-SNE visualization of frozen backbone features for Cholec80 phase recognition. The plot also reports the clustering quality metrics Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), where higher values are better. Our PL-Stitch model demonstrates superior class separation (ARI: 0.3536, NMI: 0.4546) compared to VideoMAEv2, DINO, and iBOT.
  • ...and 5 more figures