Table of Contents
Fetching ...

ZeroSCD: Zero-Shot Street Scene Change Detection

Shyam Sundar Kannan, Byung-Cheol Min

TL;DR

ZeroSCD addresses the challenge of change detection without training data by leveraging pre-trained Visual Place Recognition features from PlaceFormer and class-agnostic segmentation from SAM to detect and localize scene changes between two images. The method computes patch-level correspondences, estimates a homography, derives a coarse change map, and refines boundaries with segmentation, all in a zero-shot, training-free pipeline. It achieves state-of-the-art or competitive performance on VL-CMU-CD and PCD2015 benchmarks without dataset-specific training, demonstrating robustness to style variations and generalization across urban environments. This approach offers a scalable, practical solution for autonomous map updates, with potential extensions to other domains and unified foundational models to reduce computational overhead.

Abstract

Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.

ZeroSCD: Zero-Shot Street Scene Change Detection

TL;DR

ZeroSCD addresses the challenge of change detection without training data by leveraging pre-trained Visual Place Recognition features from PlaceFormer and class-agnostic segmentation from SAM to detect and localize scene changes between two images. The method computes patch-level correspondences, estimates a homography, derives a coarse change map, and refines boundaries with segmentation, all in a zero-shot, training-free pipeline. It achieves state-of-the-art or competitive performance on VL-CMU-CD and PCD2015 benchmarks without dataset-specific training, demonstrating robustness to style variations and generalization across urban environments. This approach offers a scalable, practical solution for autonomous map updates, with potential extensions to other domains and unified foundational models to reduce computational overhead.

Abstract

Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.
Paper Structure (19 sections, 3 equations, 4 figures, 3 tables)

This paper contains 19 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Two images of the same location, taken before and after the construction of a roundabout, are shown. The areas where changes have occurred are highlighted with a red box. Detecting these changes related to the roundabout and signs is essential to ensure the map is updated for the safe navigation of autonomous vehicles.
  • Figure 2: Architecture of the ZeroSCD framework. In ZeroSCD the input images are passed through the image encoder and the patch embeddings are extracted. The homography between the two images is then computed based on the correspondences between the images. Based on the relation between the two images estimated using the homography, a coarse difference map is computed. This difference map identifies patches where changes have occurred. This difference map is then compared with the segmented output of SAM and the segments estimated by SAM that align with the coarse difference map are estimated. The summation of all the segments corresponding to changed regions yields the final change binary mask.
  • Figure 3: Binary change masks generated by our method on various VL-CMU-CD dataset along with the input images and the ground truth.
  • Figure 4: Binary change mask generated by our method for an image pair from the Tsunami dataset along with the input images and the ground truth.