Table of Contents
Fetching ...

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Charig Yang, Weidi Xie, Andrew Zisserman

TL;DR

This work tackles the problem of discovering and localizing monotonic temporal changes in image sequences by framing it as a self-supervised video ordering task. It introduces a transformer-based ordering model with built-in attribution that identifies which regions drive the time-correlated changes and outputs an ordinal sequence without labeled data. The approach demonstrates robust localization and segmentation of monotonic changes across diverse domains (satellite imagery, medical MRI, and natural scenes) and achieves state-of-the-art performance on standard image-ordering benchmarks. The attribution maps also serve as effective prompts for segmentation, enabling practical downstream use without supervision and supporting zero-shot analysis of unseen sequences.

Abstract

Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal, since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple domains covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state-of-the-art on standard benchmarks for image ordering.

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

TL;DR

This work tackles the problem of discovering and localizing monotonic temporal changes in image sequences by framing it as a self-supervised video ordering task. It introduces a transformer-based ordering model with built-in attribution that identifies which regions drive the time-correlated changes and outputs an ordinal sequence without labeled data. The approach demonstrates robust localization and segmentation of monotonic changes across diverse domains (satellite imagery, medical MRI, and natural scenes) and achieves state-of-the-art performance on standard image-ordering benchmarks. The attribution maps also serve as effective prompts for segmentation, enabling practical downstream use without supervision and supporting zero-shot analysis of unseen sequences.

Abstract

Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal, since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple domains covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state-of-the-art on standard benchmarks for image ordering.
Paper Structure (36 sections, 1 equation, 12 figures, 14 tables)

This paper contains 36 sections, 1 equation, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Localizing monotonic temporal changes. Top: satellite images (ordered left to right) taken months apart, containing several changes -- some are monotonic (e.g. urbanization), while others are seasonal/cyclic (e.g. water level). Bottom: Our model's attribution map prediction on the sequence is able to localize the regions with monotonic temporal changes (in green), while being invariant to the seasonal and sporadic changes. The model is trained with no manual supervision, generalises to unseen sequences (as here), and the attribution map can also be used as a prompt to obtain segmentation.
  • Figure 2: Network architecture. For an unordered sequence of $F$ frames each with $N$ patches, the transformer encoder takes in all $FN$ patches as input, and outputs $FN$ features. The transformer decoder takes in $Q$ learnable queries, each corresponding to an ordinal position, and the encoder output for cross-attention, resulting in $Q$ features for output. A $FN \times Q$ cosine similarity matrix is constructed between all pairs of features from the encoder and decoder outputs, and the spatial max-pooling over this matrix reveals the $F \times Q$ order predictions. The ordering can simply then be obtained by taking an argmax along each query axis. In the example sequence, the hour hand is correlated monotonically with time, and appears in the attribution map.
  • Figure 3: Sequence datasets. From left to right: dynamic Random Dot Stereograms (RDS) (moving dots colored only for illustration), moving camouflaged animals (MoCA), timelapse clocks (cropped/full), timelapse scenes, MUDS, CalFire, OASIS-3.
  • Figure 4: (a) To evaluate localization and segmentation performance, we manually annotate the monotonically changing regions (shown in yellow) on the MUDS test set. Each sequence contains four frames, and the monotonic changes between the first and last frames are annotated. (b) 4-digit MNIST lecun1998gradient (left) and SVHN netzer2011reading (right). The task is to order the images by the numbers they contain in increasing order (top to bottom).
  • Figure 5: Ordering and Localization results across various datasets, where the model is able to discover and localize various cues across different domains, including object motion, clocks, scenery, landscape and biological aging. The left column shows the input (unordered) images. Each column of the similarity matrix represents the model's prediction of each individual order (0, 1, 2, 3), where the image in the red box is chosen, and the attention heat map within the box localizes the change.
  • ...and 7 more figures