SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts
Fan Zhang, Haoyuan Ren, Fei Ma, Qiang Yin, Yongsheng Zhou
TL;DR
This work tackles cross-view object geo-localization between drone and satellite images, a problem plagued by severe viewpoint and scale disparities. It introduces SMGeo, a promptable end-to-end transformer with a shared Swin Transformer backbone and a grid-level sparse Mixture-of-Experts (GMoE) that adaptively specializes features by region, plus an anchor-free localization head for direct center and size regression. Core contributions include the grid-level MoE cross-view encoder, a query-guided fusion mechanism, and an entropy-regularized gating strategy, all trained with a multi-task objective combining center heatmaps, bounding boxes, and MoE regularization. On the challenging CVOGL_DroneAerial dataset, SMGeo achieves state-of-the-art accuracy (e.g., acc@0.25=87.51%, acc@0.5=62.50%, mIoU=61.45%), significantly outperforming prior methods and demonstrating robust performance across object categories and scales, with ablations confirming the effectiveness of shared encoding, fusion, and GMoE components.
Abstract
Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage "retrieval-matching" pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.
