ZeroSCD: Zero-Shot Street Scene Change Detection
Shyam Sundar Kannan, Byung-Cheol Min
TL;DR
ZeroSCD addresses the challenge of change detection without training data by leveraging pre-trained Visual Place Recognition features from PlaceFormer and class-agnostic segmentation from SAM to detect and localize scene changes between two images. The method computes patch-level correspondences, estimates a homography, derives a coarse change map, and refines boundaries with segmentation, all in a zero-shot, training-free pipeline. It achieves state-of-the-art or competitive performance on VL-CMU-CD and PCD2015 benchmarks without dataset-specific training, demonstrating robustness to style variations and generalization across urban environments. This approach offers a scalable, practical solution for autonomous map updates, with potential extensions to other domains and unified foundational models to reduce computational overhead.
Abstract
Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.
