Table of Contents
Fetching ...

Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection

Andrew Kiruluta, Eric Lundy, Andreas Lemos

TL;DR

This work tackles remote sensing change detection by unifying object-level priors, diffusion-based generative refinement, and per-pixel semantic categorization. It uses Mask R-CNN to isolate temporally novel objects, then applies a diffusion model with hierarchical cross-attention that integrates object-level and global contexts to refine change maps. A lightweight 1×1 head performs multi-class categorization, and SSIM-based perceptual refinement aligns outputs with human perception. Across synthetic and real benchmarks, the method achieves state-of-the-art F1 and IoU, with the multi-class variant delivering detailed semantic insights while maintaining detection quality, offering robust, interpretable change maps for applications in urban monitoring, disaster assessment, and environmental management.

Abstract

We present a unified change detection pipeline that combines instance level masking, multi\-scale attention within a denoising diffusion model, and per pixel semantic classification, all refined via SSIM to match human perception. By first isolating only temporally novel objects with Mask R\-CNN, then guiding diffusion updates through hierarchical cross attention to object and global contexts, and finally categorizing each pixel into one of C change types, our method delivers detailed, interpretable multi\-class maps. It outperforms traditional differencing, Siamese CNNs, and GAN\-based detectors by 10\-25 points in F1 and IoU on both synthetic and real world benchmarks, marking a new state of the art in remote sensing change detection.

Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection

TL;DR

This work tackles remote sensing change detection by unifying object-level priors, diffusion-based generative refinement, and per-pixel semantic categorization. It uses Mask R-CNN to isolate temporally novel objects, then applies a diffusion model with hierarchical cross-attention that integrates object-level and global contexts to refine change maps. A lightweight 1×1 head performs multi-class categorization, and SSIM-based perceptual refinement aligns outputs with human perception. Across synthetic and real benchmarks, the method achieves state-of-the-art F1 and IoU, with the multi-class variant delivering detailed semantic insights while maintaining detection quality, offering robust, interpretable change maps for applications in urban monitoring, disaster assessment, and environmental management.

Abstract

We present a unified change detection pipeline that combines instance level masking, multi\-scale attention within a denoising diffusion model, and per pixel semantic classification, all refined via SSIM to match human perception. By first isolating only temporally novel objects with Mask R\-CNN, then guiding diffusion updates through hierarchical cross attention to object and global contexts, and finally categorizing each pixel into one of C change types, our method delivers detailed, interpretable multi\-class maps. It outperforms traditional differencing, Siamese CNNs, and GAN\-based detectors by 10\-25 points in F1 and IoU on both synthetic and real world benchmarks, marking a new state of the art in remote sensing change detection.
Paper Structure (16 sections, 31 equations, 2 figures, 2 tables)

This paper contains 16 sections, 31 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the proposed four‐stage change detection pipeline. Stage 1 (Object Detection & Mask Generation): Given bi‐temporal images $I_{1},I_{2}\in\mathbb{R}^{H\times W\times3}$, a Mask R-CNN detector produces detections $D_{k}=\{(b_{i}^{k},c_{i}^{k},s_{i}^{k})\}_{i=1}^{N_{k}}$. Unique objects are selected by matching via $\mathrm{IoU}(b,b')=\frac{\mathrm{area}(b\cap b')}{\mathrm{area}(b\cup b')}\!>\!\tau_{\mathrm{IoU}}$ with $c=c'$, yielding binary masks $M_{k}(x,y)=\sum_{(b,c,s)\in D_{k}^{\mathrm{uniq}}}\mathbf{1}_{(x,y)\in b}$. Stage 2 (Hierarchical Attention Diffusion): We form the initial difference $\Delta_{0}=\bigl|M_{1}\odot I_{1}-M_{2}\odot I_{2}\bigr|$ and add noise $x_{T}=\Delta_{0}+\epsilon_{T},\ \epsilon_{T}\sim\mathcal{N}(0,\sigma^{2}I)$. At each reverse step $t$, query embeddings $Q_{t}=W_{Q}\,\mathrm{Flatten}(x_{t})$ attend to multi‐scale keys $K_{t}^{(s)}=W_{K}^{(s)}F^{(s)}$ and values $V_{t}^{(s)}=W_{V}^{(s)}F^{(s)}$, producing attention outputs $\mathrm{Attn}_{t}^{(s)}=\mathrm{softmax}\!\bigl(Q_{t}K_{t}^{(s)\top}/\sqrt{d_{k}}\bigr)\,V_{t}^{(s)}$. These are concatenated and fused as $\mathrm{Attn}_{t}^{\mathrm{hier}}=W_{O}\bigl[\mathrm{Attn}_{t}^{(1)}\Vert\mathrm{Attn}_{t}^{(2)}\Vert\mathrm{Attn}_{t}^{(\mathrm{glob})}\bigr]$, and the denoising update is $\hat{\epsilon}_{t}=\epsilon_{\theta}(x_{t},t)+\mathrm{Attn}_{t}^{\mathrm{hier}}$, followed by $x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\bigl(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\,\hat{\epsilon}_{t}\bigr)+\sigma_{t}z_{t}$. Stage 3 (Multi‐Class Change Categorization): The refined map $\Delta^{*}=x_{0}$ is fed through a $1\times1$ convolution and softmax, giving $S_{ijc}=\exp(u_{ijc})/\sum_{c'}\exp(u_{ijc'})$ with $u=\mathrm{Conv}_{1\times1}(\Delta^{*})$. Stage 4 (SSIM‐Based Perceptual Refinement): For each class channel $c$, compute local SSIM as $\mathrm{SSIM}_{c}(i,j)=\frac{(2\mu_{x}\mu_{y}+C_{1})(2\sigma_{xy}+C_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+C_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2})}$ and fuse via $S^{\mathrm{ref}}_{ijc}=\lambda\,S_{ijc}+(1-\lambda)\bigl(1-\mathrm{SSIM}_{c}(i,j)\bigr)$ to produce the final change map.
  • Figure 2: Change detection example. (a) The baseline image $I_{1}$ at time $t_{1}$. (b) The follow-up image $I_{2}$ at time $t_{2}$, showing added and removed structures. (c) The final change map $\Delta^{\mathrm{ref}}$, obtained by first computing the masked difference $\Delta_{0}=|M_{1}\odot I_{1} - M_{2}\odot I_{2}|$, then applying the attention‐augmented reverse diffusion to yield $\Delta^{*}=x_{0}$ with hierarchical multi‐scale attention, followed by multi‐class softmax classification and SSIM‐based fusion. Darker regions in (c) indicate higher confidence of change, accurately highlighting both appearance and disappearance of objects.