Table of Contents
Fetching ...

Anchor-free Cross-view Object Geo-localization with Gaussian Position Encoding and Cross-view Association

Xingtao Ling, Chenlin Fu, Yingying Zhu

TL;DR

The paper tackles cross-view object geo-localization under large viewpoint gaps and positional uncertainty by introducing AFGeo, an anchor-free framework that forgoes predefined anchors in favor of direct pixel-wise localization. It couples Gaussian Position Encoding (GPE), which models the query click point as a learnable 2D Gaussian, with a Cross-view Object Association Module (CVOAM) that aligns semantically consistent context across views, all within a lightweight architecture. The method features an FCOS-inspired anchor-free localization head that decouples classification and regression and employs FCOS-style targets and a multi-term loss including focal, BCE, and GIoU components. AFGeo achieves state-of-the-art results on the CVOGL and G2D benchmarks, demonstrating strong localization accuracy with minimal parameter overhead and enabling deployment in resource-constrained scenarios.

Abstract

Most existing cross-view object geo-localization approaches adopt anchor-based paradigm. Although effective, such methods are inherently constrained by predefined anchors. To eliminate this dependency, we first propose an anchor-free formulation for cross-view object geo-localization, termed AFGeo. AFGeo directly predicts the four directional offsets (left, right, top, bottom) to the ground-truth box for each pixel, thereby localizing the object without any predefined anchors. To obtain a more robust spatial prior, AFGeo incorporates Gaussian Position Encoding (GPE) to model the click point in the query image, mitigating the uncertainty of object position that challenges object localization in cross-view scenarios. In addition, AFGeo incorporates a Cross-view Object Association Module (CVOAM) that relates the same object and its surrounding context across viewpoints, enabling reliable localization under large cross-view appearance gaps. By adopting an anchor-free localization paradigm that integrates GPE and CVOAM with minimal parameter overhead, our model is both lightweight and computationally efficient, achieving state-of-the-art performance on benchmark datasets.

Anchor-free Cross-view Object Geo-localization with Gaussian Position Encoding and Cross-view Association

TL;DR

The paper tackles cross-view object geo-localization under large viewpoint gaps and positional uncertainty by introducing AFGeo, an anchor-free framework that forgoes predefined anchors in favor of direct pixel-wise localization. It couples Gaussian Position Encoding (GPE), which models the query click point as a learnable 2D Gaussian, with a Cross-view Object Association Module (CVOAM) that aligns semantically consistent context across views, all within a lightweight architecture. The method features an FCOS-inspired anchor-free localization head that decouples classification and regression and employs FCOS-style targets and a multi-term loss including focal, BCE, and GIoU components. AFGeo achieves state-of-the-art results on the CVOGL and G2D benchmarks, demonstrating strong localization accuracy with minimal parameter overhead and enabling deployment in resource-constrained scenarios.

Abstract

Most existing cross-view object geo-localization approaches adopt anchor-based paradigm. Although effective, such methods are inherently constrained by predefined anchors. To eliminate this dependency, we first propose an anchor-free formulation for cross-view object geo-localization, termed AFGeo. AFGeo directly predicts the four directional offsets (left, right, top, bottom) to the ground-truth box for each pixel, thereby localizing the object without any predefined anchors. To obtain a more robust spatial prior, AFGeo incorporates Gaussian Position Encoding (GPE) to model the click point in the query image, mitigating the uncertainty of object position that challenges object localization in cross-view scenarios. In addition, AFGeo incorporates a Cross-view Object Association Module (CVOAM) that relates the same object and its surrounding context across viewpoints, enabling reliable localization under large cross-view appearance gaps. By adopting an anchor-free localization paradigm that integrates GPE and CVOAM with minimal parameter overhead, our model is both lightweight and computationally efficient, achieving state-of-the-art performance on benchmark datasets.

Paper Structure

This paper contains 14 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of the cross-view object geo-localization task and comparison of different solutions. (a) The reference image is usually a satellite image with geographic information (a GPS coordinate). The query image is from a ground or drone view, with the click point (red dots) of the object of interest. Cross-view object geo-localization aims to find the geographic location of an object of interest in a query image from a reference image with attached geographic information. (b) The retrieval-based scheme divides the satellite image into uniformly-sized patches to construct a reference database and attempts to retrieve the patch that is most semantically similar to the object of interest as a location proposal. The anchor-based scheme generates a series of candidate bounding boxes for each anchor based on predefined anchors, and then selects the best bounding box containing the object of interest from these candidates. The anchor-free (ours) scheme directly predicts the object location at each pixel and selects the best bounding box containing the object of interest. Obviously, our scheme is no longer constrained by predefined anchors.
  • Figure 2: Overview of our AFGeo. AFGeo adopts an anchor-free architecture for cross-view object geo-localization. The structure consists of two key components: Gaussian Position Encoding and the Cross-View Object Association Module, with the localization head following the anchor-free paradigm.
  • Figure 3: Illustration of the proposed GPE. (a) We employ a learnable standard deviation to control the Gaussian distribution, enabling the model to adaptively learn the relationship between the relative distance of the object and the correlation of object features, meaning that the farther the relative distance, the weaker the feature correlation. (b) We visualize the position encodings under different Gaussian distributions. The learnable GPE allows the model to better adapt to the uncertainty of object locations, focusing more on the object feature regions (gray–white areas) while ignoring the interference from irrelevant background (black areas).
  • Figure 4: Heatmap visualization of Baseline (our framework without GPE and CVPAM) and the proposed AFGeo. The red dot denotes the click point and the red box denotes the ground-truth bounding box.
  • Figure 5: Object localization results by our AFGeo. The red and green bounding boxes denote the ground-truth and predicted results, respectively.