Table of Contents
Fetching ...

MOGeo: Beyond One-to-One Cross-View Object Geo-localization

Bo Lv, Qingwang Zhang, Le Wu, Yuanyuan Li, Yingying Zhu

Abstract

Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistic setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance the CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross-view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods. Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view object geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.

MOGeo: Beyond One-to-One Cross-View Object Geo-localization

Abstract

Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistic setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance the CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross-view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods. Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view object geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.
Paper Structure (17 sections, 9 equations, 8 figures, 5 tables)

This paper contains 17 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparison of cross-view object geo-localization in single-object and multi-object scenarios. Click points represent query objects, while bounding boxes in geo-tagged satellite images indicate location information. Points and bounding boxes of the same color form an object pair, such as $p_1$ and $b_1$, where $b_1$'s geographic location is considered the position of $p_1$.
  • Figure 2: Attention map comparison: previous vs. ours. Previous smooth positional encodings lead to diffuse attention maps, whereas our proposed attention map successfully concentrates on the target location to provide highly discriminative features.
  • Figure 3: Examples from the CMLocation-V1 and CMLocation-V2 datasets. CMLocation-V1 is curated under strict alignment principles—specifically center alignment and northward orientation—which ensure a consistent spatial distribution across all instances. By contrast, CMLocation-V2 does not impose these constraints, resulting in more heterogeneous spatial configurations.
  • Figure 4: The size distributions of objects in the CMLocation-V1 and CMLocation-V2 datasets.
  • Figure 5: Overview of our proposed MOGeo model. Our method takes as input a query image that contains an arbitrary number of objects of interest (either ground-view images or drone-view images) and a reference image that contains the objects to be localized (satellite image). The input images are processed by our position encoding module, MOPE, to extract feature information for each query object. Then, the query object feature information is fused with the reference image to generate the final prediction results. In the figure, colored bounding boxes indicate the object pairs.
  • ...and 3 more figures