Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
Charig Yang, Weidi Xie, Andrew Zisserman
TL;DR
This work tackles the problem of discovering and localizing monotonic temporal changes in image sequences by framing it as a self-supervised video ordering task. It introduces a transformer-based ordering model with built-in attribution that identifies which regions drive the time-correlated changes and outputs an ordinal sequence without labeled data. The approach demonstrates robust localization and segmentation of monotonic changes across diverse domains (satellite imagery, medical MRI, and natural scenes) and achieves state-of-the-art performance on standard image-ordering benchmarks. The attribution maps also serve as effective prompts for segmentation, enabling practical downstream use without supervision and supporting zero-shot analysis of unseen sequences.
Abstract
Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal, since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple domains covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state-of-the-art on standard benchmarks for image ordering.
