Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection

Andrew Kiruluta; Eric Lundy; Andreas Lemos

Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection

Andrew Kiruluta, Eric Lundy, Andreas Lemos

TL;DR

This work tackles remote sensing change detection by unifying object-level priors, diffusion-based generative refinement, and per-pixel semantic categorization. It uses Mask R-CNN to isolate temporally novel objects, then applies a diffusion model with hierarchical cross-attention that integrates object-level and global contexts to refine change maps. A lightweight 1×1 head performs multi-class categorization, and SSIM-based perceptual refinement aligns outputs with human perception. Across synthetic and real benchmarks, the method achieves state-of-the-art F1 and IoU, with the multi-class variant delivering detailed semantic insights while maintaining detection quality, offering robust, interpretable change maps for applications in urban monitoring, disaster assessment, and environmental management.

Abstract

We present a unified change detection pipeline that combines instance level masking, multi\-scale attention within a denoising diffusion model, and per pixel semantic classification, all refined via SSIM to match human perception. By first isolating only temporally novel objects with Mask R\-CNN, then guiding diffusion updates through hierarchical cross attention to object and global contexts, and finally categorizing each pixel into one of C change types, our method delivers detailed, interpretable multi\-class maps. It outperforms traditional differencing, Siamese CNNs, and GAN\-based detectors by 10\-25 points in F1 and IoU on both synthetic and real world benchmarks, marking a new state of the art in remote sensing change detection.

Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection

TL;DR

Abstract

Paper Structure (16 sections, 31 equations, 2 figures, 2 tables)

This paper contains 16 sections, 31 equations, 2 figures, 2 tables.

Introduction
Background and Related Work
Traditional Change Detection Methods
Learning‐Based Change Detection
Generative Models for Change Detection
Object Detection in Change Detection
Attention Mechanisms in Generative Refinement
Methodology
Object Detection and Filtering
Diffusion with Hierarchical Attention
Multi‐Class Change Categorization
SSIM‐Based Refinement
Unified Loss and Novelty
Experimental Setup
Conclusion
...and 1 more sections

Figures (2)

Figure 1: Overview of the proposed four‐stage change detection pipeline. Stage 1 (Object Detection & Mask Generation): Given bi‐temporal images $I_{1},I_{2}\in\mathbb{R}^{H\times W\times3}$, a Mask R-CNN detector produces detections $D_{k}=\{(b_{i}^{k},c_{i}^{k},s_{i}^{k})\}_{i=1}^{N_{k}}$. Unique objects are selected by matching via $\mathrm{IoU}(b,b')=\frac{\mathrm{area}(b\cap b')}{\mathrm{area}(b\cup b')}\!>\!\tau_{\mathrm{IoU}}$ with $c=c'$, yielding binary masks $M_{k}(x,y)=\sum_{(b,c,s)\in D_{k}^{\mathrm{uniq}}}\mathbf{1}_{(x,y)\in b}$. Stage 2 (Hierarchical Attention Diffusion): We form the initial difference $\Delta_{0}=\bigl|M_{1}\odot I_{1}-M_{2}\odot I_{2}\bigr|$ and add noise $x_{T}=\Delta_{0}+\epsilon_{T},\ \epsilon_{T}\sim\mathcal{N}(0,\sigma^{2}I)$. At each reverse step $t$, query embeddings $Q_{t}=W_{Q}\,\mathrm{Flatten}(x_{t})$ attend to multi‐scale keys $K_{t}^{(s)}=W_{K}^{(s)}F^{(s)}$ and values $V_{t}^{(s)}=W_{V}^{(s)}F^{(s)}$, producing attention outputs $\mathrm{Attn}_{t}^{(s)}=\mathrm{softmax}\!\bigl(Q_{t}K_{t}^{(s)\top}/\sqrt{d_{k}}\bigr)\,V_{t}^{(s)}$. These are concatenated and fused as $\mathrm{Attn}_{t}^{\mathrm{hier}}=W_{O}\bigl[\mathrm{Attn}_{t}^{(1)}\Vert\mathrm{Attn}_{t}^{(2)}\Vert\mathrm{Attn}_{t}^{(\mathrm{glob})}\bigr]$, and the denoising update is $\hat{\epsilon}_{t}=\epsilon_{\theta}(x_{t},t)+\mathrm{Attn}_{t}^{\mathrm{hier}}$, followed by $x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\bigl(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\,\hat{\epsilon}_{t}\bigr)+\sigma_{t}z_{t}$. Stage 3 (Multi‐Class Change Categorization): The refined map $\Delta^{*}=x_{0}$ is fed through a $1\times1$ convolution and softmax, giving $S_{ijc}=\exp(u_{ijc})/\sum_{c'}\exp(u_{ijc'})$ with $u=\mathrm{Conv}_{1\times1}(\Delta^{*})$. Stage 4 (SSIM‐Based Perceptual Refinement): For each class channel $c$, compute local SSIM as $\mathrm{SSIM}_{c}(i,j)=\frac{(2\mu_{x}\mu_{y}+C_{1})(2\sigma_{xy}+C_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+C_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2})}$ and fuse via $S^{\mathrm{ref}}_{ijc}=\lambda\,S_{ijc}+(1-\lambda)\bigl(1-\mathrm{SSIM}_{c}(i,j)\bigr)$ to produce the final change map.
Figure 2: Change detection example. (a) The baseline image $I_{1}$ at time $t_{1}$. (b) The follow-up image $I_{2}$ at time $t_{2}$, showing added and removed structures. (c) The final change map $\Delta^{\mathrm{ref}}$, obtained by first computing the masked difference $\Delta_{0}=|M_{1}\odot I_{1} - M_{2}\odot I_{2}|$, then applying the attention‐augmented reverse diffusion to yield $\Delta^{*}=x_{0}$ with hierarchical multi‐scale attention, followed by multi‐class softmax classification and SSIM‐based fusion. Darker regions in (c) indicate higher confidence of change, accurately highlighting both appearance and disappearance of objects.

Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection

TL;DR

Abstract

Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)