DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing
Jinyue Zhang, Xiangrong Zhang, Zhongjian Huang, Tianyang Zhang, Yifei Jiang, Licheng Jiao
TL;DR
DiffMOD reframes moving object detection in remote sensing as a progressive diffusion-based denoising task over scattered points, enabling high-order spatial-temporal interactions. It introduces Spatial Relation Aggregation Attention (SRAA) and Temporal Propagation and Global Fusion (TPGF) to capture inter-point relationships and cross-frame memory, respectively, and couples these with a MinK OTA-based training regime and a target missing loss to mitigate point clustering. Evaluations on RsData show improved recall, precision, and temporal consistency over baselines, particularly for extremely small and noisy targets. The approach demonstrates the value of sparse point modeling and diffusion-inspired optimization for robust MOD in challenging satellite video data.
Abstract
Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.
