Table of Contents
Fetching ...

Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms

Chun-Jung Lin, Sourav Garg, Tat-Jun Chin, Feras Dayoub

TL;DR

The paper tackles scene change detection under challenging photometric and geometric variations, proposing a robust framework that freezes a DINOv2 visual foundation backbone and uses full-image cross-attention to learn reliable correspondences between image pairs. Dense features from both times are registered via cross-attention and fused to predict a change mask, with a lightweight decoder and a weighted cross-entropy loss to handle class imbalance. Extensive experiments on VL-CMU-CD and PSCD, including unaligned and viewpoint-augmented variants, show superior F1-scores and strong generalization, supported by comprehensive ablations confirming the effectiveness of the cross-attention comparator and architectural choices. The results indicate strong potential for real-world deployment in autonomous driving, urban monitoring, and surveillance, where robust change detection must tolerate viewpoint and lighting variations and adapt to unseen environments.

Abstract

We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) ``freeze'' the backbone in order to retain the generality of dense foundation features, and b) employ ``full-image'' cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method's superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Our source code is available at: https://github.com/ChadLin9596/Robust-Scene-Change-Detection.

Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms

TL;DR

The paper tackles scene change detection under challenging photometric and geometric variations, proposing a robust framework that freezes a DINOv2 visual foundation backbone and uses full-image cross-attention to learn reliable correspondences between image pairs. Dense features from both times are registered via cross-attention and fused to predict a change mask, with a lightweight decoder and a weighted cross-entropy loss to handle class imbalance. Extensive experiments on VL-CMU-CD and PSCD, including unaligned and viewpoint-augmented variants, show superior F1-scores and strong generalization, supported by comprehensive ablations confirming the effectiveness of the cross-attention comparator and architectural choices. The results indicate strong potential for real-world deployment in autonomous driving, urban monitoring, and surveillance, where robust change detection must tolerate viewpoint and lighting variations and adapt to unseen environments.

Abstract

We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) ``freeze'' the backbone in order to retain the generality of dense foundation features, and b) employ ``full-image'' cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method's superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Our source code is available at: https://github.com/ChadLin9596/Robust-Scene-Change-Detection.
Paper Structure (23 sections, 4 figures, 7 tables)

This paper contains 23 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Unaligned images change detection: we approach the change detection problem with cross attention module, making robust detection on unaligned scenes.
  • Figure 2: Architecture: An overview of the proposed change detection architecture, where the backbone is kept frozen to achieve better overall generalization. $F_0$ and $F_1$ are the dense feature from $t_0$ and $t_1$ images, respectively.
  • Figure 3: F1-score of Affine Transformation: we evalute F1-score after translate (trans.) or rotate (rot.) $t_0$ images from VL-CMU-CD test set. (a) and (b) are translation results without (wo.) and with (w.) different viewpoint augmentation (diff-view augment). (c) and (d) are rotation results before and after the augmentation. The blue line indicates ours results. C3PO, DR-TANet, and CDNet results are plotted as orange, green, and red lines, respectively.
  • Figure 4: Qualitative Results: we visualize results from "Aligned" of VL-CMU-CD in rows 2 and 5. The other rows are from "Diff-2". The first scene compares the same $t_0$ image with a sequence of $t_1$ images, while the other compares the opposite.