Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features
Xingtao Ling Yingying Zhu
TL;DR
The paper tackles cross-view object geo-localization, where localization must be performed across substantially different viewpoints (ground/drone vs. satellite). It introduces AttenGeo, featuring a Cross-view and Cross-attention Module (CVCAM) for iterative cross-view interaction and a Multi-head Spatial Attention Module (MHSAM) for multi-scale spatial refinement, together improving cross-view correspondence and reducing edge-noise interference. Key contributions include the CVCAM and MHSAM modules, the G2D ground-to-drone dataset, and state-of-the-art results on CVOGL and G2D datasets demonstrating superior localization accuracy. The work advances practical cross-view localization by enabling richer contextual learning across views and robust feature refinement, with implications for robotics and autonomous navigation applications.
Abstract
Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the "Ground-to-Drone" localization task, enriching existing datasets and filling the gap in "Ground-to-Drone" localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.
