Context-Enhanced Detector For Building Detection From Remote Sensing Images
Ziyue Huang, Mingming Zhang, Qingjie Liu, Wei Wang, Zhe Dong, Yunhong Wang
TL;DR
This paper tackles the challenge of accurate building detection in diverse remote sensing scenes by introducing Context-Enhanced Detector (CEDet), a three-stage cascade that explicitly models contextual information. It combines a Semantic Guided Contextual Mining (SGCM) module for multi-scale semantic fusion with a self-attention mechanism and a pseudo-masks segmentation loss, and an Instance Context Mining Module (ICMM) to capture instance-level spatial relationships via a relational graph. The CE Head decouples classification and regression and integrates ICMM within a cascade framework, leading to state-of-the-art performance on CNBuilding-9P, CNBuilding-23P, and SpaceNet, with notable gains in AP50 and AP75. Overall, the approach demonstrates that incorporating both global and instance-level contextual cues significantly enhances building detection in complex urban and suburban scenes, validating the practical value of context-aware detection in remote sensing applications.
Abstract
The field of building detection from remote sensing images has made significant progress, but faces challenges in achieving high-accuracy detection due to the diversity in building appearances and the complexity of vast scenes. To address these challenges, we propose a novel approach called Context-Enhanced Detector (CEDet). Our approach utilizes a three-stage cascade structure to enhance the extraction of contextual information and improve building detection accuracy. Specifically, we introduce two modules: the Semantic Guided Contextual Mining (SGCM) module, which aggregates multi-scale contexts and incorporates an attention mechanism to capture long-range interactions, and the Instance Context Mining Module (ICMM), which captures instance-level relationship context by constructing a spatial relationship graph and aggregating instance features. Additionally, we introduce a semantic segmentation loss based on pseudo-masks to guide contextual information extraction. Our method achieves state-of-the-art performance on three building detection benchmarks, including CNBuilding-9P, CNBuilding-23P, and SpaceNet.
