Table of Contents
Fetching ...

Towards Generalizable Scene Change Detection

Jaewoo Kim, Uehwan Kim

TL;DR

The paper tackles the generalization gap in scene change detection by proposing GeSCF, a zero shot framework that repurposes the Segment Anything Model for bi temporal change detection. It introduces the GeSCD benchmark and the ChangeVPR dataset to evaluate cross domain generalization and temporal consistency, and demonstrates that GeSCF yields substantial gains over state of the art on unseen domains while maintaining complete temporal consistency. The method relies on two stages: initial pseudo mask generation via cross temporal feature correlation and an adaptive skewness based threshold, followed by geometric and semantic mask matching that refines object level changes using SAMs class agnostic masks and mask embeddings. The results indicate strong zero shot performance and suggest that the GeSCD and ChangeVPR benchmarks provide a solid foundation for robust and generalizable SCD research with practical impact for real world scenarios.

Abstract

While current state-of-the-art Scene Change Detection (SCD) approaches achieve impressive results in well-trained research data, they become unreliable under unseen environments and different temporal conditions; in-domain performance drops from 77.6% to 8.0% in a previously unseen environment and to 4.6% under a different temporal condition -- calling for generalizable SCD and benchmark. In this work, we propose the Generalizable Scene Change Detection Framework (GeSCF), which addresses unseen domain performance and temporal consistency -- to meet the growing demand for anything SCD. Our method leverages the pre-trained Segment Anything Model (SAM) in a zero-shot manner. For this, we design Initial Pseudo-mask Generation and Geometric-Semantic Mask Matching -- seamlessly turning user-guided prompt and single-image based segmentation into scene change detection for a pair of inputs without guidance. Furthermore, we define the Generalizable Scene Change Detection (GeSCD) benchmark along with novel metrics and an evaluation protocol to facilitate SCD research in generalizability. In the process, we introduce the ChangeVPR dataset, a collection of challenging image pairs with diverse environmental scenarios -- including urban, suburban, and rural settings. Extensive experiments across various datasets demonstrate that GeSCF achieves an average performance gain of 19.2% on existing SCD datasets and 30.0% on the ChangeVPR dataset, nearly doubling the prior art performance. We believe our work can lay a solid foundation for robust and generalizable SCD research.

Towards Generalizable Scene Change Detection

TL;DR

The paper tackles the generalization gap in scene change detection by proposing GeSCF, a zero shot framework that repurposes the Segment Anything Model for bi temporal change detection. It introduces the GeSCD benchmark and the ChangeVPR dataset to evaluate cross domain generalization and temporal consistency, and demonstrates that GeSCF yields substantial gains over state of the art on unseen domains while maintaining complete temporal consistency. The method relies on two stages: initial pseudo mask generation via cross temporal feature correlation and an adaptive skewness based threshold, followed by geometric and semantic mask matching that refines object level changes using SAMs class agnostic masks and mask embeddings. The results indicate strong zero shot performance and suggest that the GeSCD and ChangeVPR benchmarks provide a solid foundation for robust and generalizable SCD research with practical impact for real world scenarios.

Abstract

While current state-of-the-art Scene Change Detection (SCD) approaches achieve impressive results in well-trained research data, they become unreliable under unseen environments and different temporal conditions; in-domain performance drops from 77.6% to 8.0% in a previously unseen environment and to 4.6% under a different temporal condition -- calling for generalizable SCD and benchmark. In this work, we propose the Generalizable Scene Change Detection Framework (GeSCF), which addresses unseen domain performance and temporal consistency -- to meet the growing demand for anything SCD. Our method leverages the pre-trained Segment Anything Model (SAM) in a zero-shot manner. For this, we design Initial Pseudo-mask Generation and Geometric-Semantic Mask Matching -- seamlessly turning user-guided prompt and single-image based segmentation into scene change detection for a pair of inputs without guidance. Furthermore, we define the Generalizable Scene Change Detection (GeSCD) benchmark along with novel metrics and an evaluation protocol to facilitate SCD research in generalizability. In the process, we introduce the ChangeVPR dataset, a collection of challenging image pairs with diverse environmental scenarios -- including urban, suburban, and rural settings. Extensive experiments across various datasets demonstrate that GeSCF achieves an average performance gain of 19.2% on existing SCD datasets and 30.0% on the ChangeVPR dataset, nearly doubling the prior art performance. We believe our work can lay a solid foundation for robust and generalizable SCD research.
Paper Structure (13 sections, 4 equations, 5 figures, 4 tables)

This paper contains 13 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparative results of the current state-of-the-art model and our GeSCF under various unseen environments (ChangeVPR). GeSCF outperforms with precise boundaries and edges, where the state-of-the-art model hardly captures changes.
  • Figure 2: Illustration of the proposed GeSCF pipeline. The GeSCF pipeline consists of two major steps: (1) initial pseudo-mask generation and (2) geometric-semantic mask matching. First, we intercept facet features from the SAM image encoder and correlate them to obtain multi-head similarity maps, which are then converted into pseudo-masks using an adaptive threshold function. Next, SAM's class-agnostic masks and last image embeddings are utilized to refine these pseudo-masks based on both geometric and semantic information.
  • Figure 3: Visualization of the similarity map depending on the facets and the layers. We use key facets from the intermediate layer of the SAM ViT image encoder, which is highlighted with a red bounding box.
  • Figure 4: Illustration of the adaptive thresholding process. We dynamically adjust the threshold based on the skewness of the distribution to generate the initial pseudo-masks.
  • Figure 5: Qualitative results on VL-CMU-CD dataset with F1-scores. Our model generates reasonable change masks and does not display annotation bias.