Table of Contents
Fetching ...

DiffRegCD: Integrated Registration and Change Detection with Diffusion Features

Seyedehanita Madani, Rama Chellappa, Vishal M. Patel

TL;DR

This work tackles change detection under misalignment by unifying dense image registration and CD within a single model. DiffRegCD leverages diffusion-pretrained, multi-scale features and reformulates correspondence as Gaussian-smoothed classification via a flow transformer, coupled with a hierarchical, flow-guided change head. A curriculum-based training strategy stabilizes joint optimization and yields sub-pixel registration alongside accurate change maps. Across aerial and street-view benchmarks, the method achieves state-of-the-art results and demonstrates strong robustness to long temporal gaps and viewpoint variations. This framework sets a new foundation for diffusion-feature–driven, jointly optimized registration and change detection in diverse real-world scenarios.

Abstract

Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.

DiffRegCD: Integrated Registration and Change Detection with Diffusion Features

TL;DR

This work tackles change detection under misalignment by unifying dense image registration and CD within a single model. DiffRegCD leverages diffusion-pretrained, multi-scale features and reformulates correspondence as Gaussian-smoothed classification via a flow transformer, coupled with a hierarchical, flow-guided change head. A curriculum-based training strategy stabilizes joint optimization and yields sub-pixel registration alongside accurate change maps. Across aerial and street-view benchmarks, the method achieves state-of-the-art results and demonstrates strong robustness to long temporal gaps and viewpoint variations. This framework sets a new foundation for diffusion-feature–driven, jointly optimized registration and change detection in diverse real-world scenarios.

Abstract

Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.

Paper Structure

This paper contains 19 sections, 17 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Pipeline of our proposed framework. (I) A DDPM-based encoder extracts multi-scale, multi-timestep features from the bi-temporal inputs $I_A$ and $I_B$. (II) A registration module, consisting of a Gaussian Prior and a Flow Transformer Decoder, aligns features with coordinate embeddings to estimate dense flows. (III) A hierarchical change decoder fuses the warped multi-scale features across resolutions to predict the final change map $P_{cd}$. This design provides robustness to misregistration while leveraging diffusion-based features for accurate change detection. The overall visualization style is inspired by the pipeline illustration in the DDPM-CD model.
  • Figure 2: Cross-dataset qualitative results under induced misregistration. Each mini-panel follows the same layout: left—inputs $I_t$ and $I_{t+1}$; right—top shows ground-truth flow (color-wheel) and change mask; bottom shows our predicted flow and change map. We display diverse aerial and street-level scenes from LEVIR-CD, WHU-CD, DSIFN-CD, SYSU-CD, and VL-CMU-CD. Patch size $256{\times}256$; level Hard: $\Delta x,y\!\in\![-25,25]$ px, $\theta\!\in\![-30^\circ,30^\circ]$, $s\!\in\![0.80,1.25]$. Beyond alignment, our method yields crisp boundaries, fewer background false alarms, and better recovery of small structures across datasets. White denotes change; threshold 0.5; no post-processing.
  • Figure 3: Qualitative comparison under induced misalignment. Dataset: LEVIR-CD, patch size $256{\times}256$. Level Hard: $\Delta x,y\!\in\![-25,25]$ px, $\theta\!\in\![-30^{\circ},30^{\circ}]$, $s\!\in\![0.80,1.25]$. Columns show inputs $I_t$ and $I_{t+1}$, predictions from BiT, ChangeRD, BiFA, DDPM-CD, RoMa$\!\to$CD, ablations of our method, our full model, and the ground-truth mask (white = change). All methods use the same threshold (0.5) and no post-processing.