Table of Contents
Fetching ...

EMPLACE: Self-Supervised Urban Scene Change Detection

Tim Alpherts, Sennay Ghebreab, Nanne van Noord

TL;DR

EMPLACE addresses urban scene change detection by introducing AC-1M, a large-scale tri-temporal panorama dataset, and a self-supervised framework that learns change representations without labels.The method uses a Vision Transformer (ViT-B/14) initialized with DINOv2 and trained with an adaptive triplet loss, complemented by cut-and-flip augmentation to handle panoramic imagery and noise.EMPLACE demonstrates strong performance as both a pre-training method and in zero-shot change detection, outperforming state-of-the-art baselines, and its Amsterdam case study shows that small visual changes correlate with housing prices while large changes exhibit distinct patterns.Collectively, the work enables scalable, label-free USCD across cities and provides tools to link visual urban change to socio-economic indicators, advancing Visual Urban Analytics beyond static, single-image analyses.

Abstract

Urban change is a constant process that influences the perception of neighbourhoods and the lives of the people within them. The field of Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision and can help raise awareness of changes that make it possible to better understand the city and its residents. Traditionally, the field of USCD has used supervised methods with small scale datasets. This constrains methods when applied to new cities, as it requires labour-intensive labeling processes and forces a priori definitions of relevant change. In this paper we introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer using our adaptive triplet loss. We show EMPLACE outperforms SOTA methods both as a pre-training method for linear fine-tuning as well as a zero-shot setting. Lastly, in a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices - which in turn is indicative of inequity.

EMPLACE: Self-Supervised Urban Scene Change Detection

TL;DR

EMPLACE addresses urban scene change detection by introducing AC-1M, a large-scale tri-temporal panorama dataset, and a self-supervised framework that learns change representations without labels.The method uses a Vision Transformer (ViT-B/14) initialized with DINOv2 and trained with an adaptive triplet loss, complemented by cut-and-flip augmentation to handle panoramic imagery and noise.EMPLACE demonstrates strong performance as both a pre-training method and in zero-shot change detection, outperforming state-of-the-art baselines, and its Amsterdam case study shows that small visual changes correlate with housing prices while large changes exhibit distinct patterns.Collectively, the work enables scalable, label-free USCD across cities and provides tools to link visual urban change to socio-economic indicators, advancing Visual Urban Analytics beyond static, single-image analyses.

Abstract

Urban change is a constant process that influences the perception of neighbourhoods and the lives of the people within them. The field of Urban Scene Change Detection (USCD) aims to capture changes in street scenes using computer vision and can help raise awareness of changes that make it possible to better understand the city and its residents. Traditionally, the field of USCD has used supervised methods with small scale datasets. This constrains methods when applied to new cities, as it requires labour-intensive labeling processes and forces a priori definitions of relevant change. In this paper we introduce AC-1M the largest USCD dataset by far of over 1.1M images, together with EMPLACE, a self-supervising method to train a Vision Transformer using our adaptive triplet loss. We show EMPLACE outperforms SOTA methods both as a pre-training method for linear fine-tuning as well as a zero-shot setting. Lastly, in a case study of Amsterdam, we show that we are able to detect both small and large changes throughout the city and that changes uncovered by EMPLACE, depending on size, correlate with housing prices - which in turn is indicative of inequity.

Paper Structure

This paper contains 21 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Examples of various urban changes in Amsterdam uncovered through EMPLACE.
  • Figure 2: Example of a cluster from the AC-1M. From top to bottom images taken on 08-08-2016, 30-07-2018, 12-05-2022. Subtle changes happen over time such as the redoing of the roof on the left, the building of the fence on the right, and the chopping of the tree in the middle.
  • Figure 3: Overview of the model architecture. Image triplets perform a forward pass through a Siamese backbone with DINOv2 weights to calculate the $cls$ tokens. The image dates are used to calculate the margin $\alpha$. The $cls$ tokens and margin $\alpha$ are used to calculate the adaptive triplet loss.
  • Figure 4: An example of cut-and-flip data augmentation.
  • Figure 5: Examples of cluster and image pairs from AMS-Trees and AMS-Buildings.
  • ...and 2 more figures