Table of Contents
Fetching ...

SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts

Fan Zhang, Haoyuan Ren, Fei Ma, Qiang Yin, Yongsheng Zhou

TL;DR

This work tackles cross-view object geo-localization between drone and satellite images, a problem plagued by severe viewpoint and scale disparities. It introduces SMGeo, a promptable end-to-end transformer with a shared Swin Transformer backbone and a grid-level sparse Mixture-of-Experts (GMoE) that adaptively specializes features by region, plus an anchor-free localization head for direct center and size regression. Core contributions include the grid-level MoE cross-view encoder, a query-guided fusion mechanism, and an entropy-regularized gating strategy, all trained with a multi-task objective combining center heatmaps, bounding boxes, and MoE regularization. On the challenging CVOGL_DroneAerial dataset, SMGeo achieves state-of-the-art accuracy (e.g., acc@0.25=87.51%, acc@0.5=62.50%, mIoU=61.45%), significantly outperforming prior methods and demonstrating robust performance across object categories and scales, with ablations confirming the effectiveness of shared encoding, fusion, and GMoE components.

Abstract

Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage "retrieval-matching" pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.

SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts

TL;DR

This work tackles cross-view object geo-localization between drone and satellite images, a problem plagued by severe viewpoint and scale disparities. It introduces SMGeo, a promptable end-to-end transformer with a shared Swin Transformer backbone and a grid-level sparse Mixture-of-Experts (GMoE) that adaptively specializes features by region, plus an anchor-free localization head for direct center and size regression. Core contributions include the grid-level MoE cross-view encoder, a query-guided fusion mechanism, and an entropy-regularized gating strategy, all trained with a multi-task objective combining center heatmaps, bounding boxes, and MoE regularization. On the challenging CVOGL_DroneAerial dataset, SMGeo achieves state-of-the-art accuracy (e.g., acc@0.25=87.51%, acc@0.5=62.50%, mIoU=61.45%), significantly outperforming prior methods and demonstrating robust performance across object categories and scales, with ablations confirming the effectiveness of shared encoding, fusion, and GMoE components.

Abstract

Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage "retrieval-matching" pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.

Paper Structure

This paper contains 25 sections, 24 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Cross-view image geo-localization. (a) Traditional image-level matching methods. Given a query image, this approach retrieves the similar images from a large-scale satellite image database and returns a ranked list. (b) Object-level geo-localization. The task aims to geo-localize a specific object by using images captured from different viewpoints.
  • Figure 2: Structure comparison between the previous cross view object geo-localization and the proposed SMGeo. (a) Typical cross view object geo-localization. Given a certain view input, the latent representations are extracted by the specific encoder, followed by object detection and merging. (b) Our SMGeo method. Designed for interactive use, the model supports click prompting and delivers real-time localization results. SMGeo introduces a grid-level Mixture-of-Experts (GMoE) based cross-view encoder that jointly learns cross-view representations. The GMoE adaptively activates specialized experts to capture both inter- and intra-view dependencies. In addition, an anchor-free head directly regresses the target’s coordinates in the reference images.
  • Figure 3: Comparison of geo-localization accuracy among different cross-view methods on the CVOGL Dataset. The horizontal axis represents the acc@0.25, and the vertical axis represents the acc@0.5. Our SMGeo method significantly outperforms existing methods on both metrics, demonstrating optimal object geo-localization performance.
  • Figure 4: Overall framework of the proposed SMGeo. A cross-view encoder based on GMoE utilizes a view-specific router for top-k expert selection to adaptively process cross-view image features. Subsequently, a cross-view feature fusion module fuses the features from the two views with the encoded click prompts. Finally, the anchor-free detection head directly regresses target center heatmaps and bounding box offsets.
  • Figure 5: Tracking the size changes of the query image and reference images.
  • ...and 9 more figures