Hierarchical Attention Diffusion Networks with Object Priors for Video Change Detection
Andrew Kiruluta, Eric Lundy, Andreas Lemos
TL;DR
This work tackles remote sensing change detection by unifying object-level priors, diffusion-based generative refinement, and per-pixel semantic categorization. It uses Mask R-CNN to isolate temporally novel objects, then applies a diffusion model with hierarchical cross-attention that integrates object-level and global contexts to refine change maps. A lightweight 1×1 head performs multi-class categorization, and SSIM-based perceptual refinement aligns outputs with human perception. Across synthetic and real benchmarks, the method achieves state-of-the-art F1 and IoU, with the multi-class variant delivering detailed semantic insights while maintaining detection quality, offering robust, interpretable change maps for applications in urban monitoring, disaster assessment, and environmental management.
Abstract
We present a unified change detection pipeline that combines instance level masking, multi\-scale attention within a denoising diffusion model, and per pixel semantic classification, all refined via SSIM to match human perception. By first isolating only temporally novel objects with Mask R\-CNN, then guiding diffusion updates through hierarchical cross attention to object and global contexts, and finally categorizing each pixel into one of C change types, our method delivers detailed, interpretable multi\-class maps. It outperforms traditional differencing, Siamese CNNs, and GAN\-based detectors by 10\-25 points in F1 and IoU on both synthetic and real world benchmarks, marking a new state of the art in remote sensing change detection.
