Table of Contents
Fetching ...

SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

Xiaoran Zhang, Yu Liu, Jinyu Liang, Kangqiushi Li, Zhiwei Huang, Huaxin Xiao

Abstract

Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

Abstract

Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

Paper Structure

This paper contains 27 sections, 18 equations, 4 figures, 5 tables, 3 algorithms.

Figures (4)

  • Figure 1: Conceptual comparison of the existing paradigm and our SCC-Loc framework. To overcome severe cross-modal bottlenecks in thermal geo-localization, SCC-Loc systematically resolves: (a) spatial quantization bias via semantic alignment (SGVA); (b) dense structural outliers via cascaded filtering (C-SATSF) ; and (c) deceptive visual decoys via consensus selection (CD-RAPS).
  • Figure 2: Overview of the proposed SCC-Loc framework. The pipeline consists of four main stages: (1) Feature Extraction: A shared DINOv2 backbone bridges the thermal-visible modality gap to extract dense spatial features and global tokens; (2) SGVA Module: The Semantic-Guided Viewport Alignment module leverages the query's global token to adaptively crop and align the retrieved satellite candidates with the UAV field-of-view; (3) C-SATSF Mechanism: Following dense matching via MINIMA$_{\text{RoMa}}$, the Cascaded Spatial-Adaptive Texture-Structure Filtering progressively purifies raw correspondences by eliminating spatial, textural, and structural inconsistencies; (4) CD-RAPS Strategy: The Consensus-Driven Reliability-Aware Position Selection integrates 3D pose optimization and multi-dimensional reliability evaluation to vote for the robust optimal horizontal position via geographic consensus.
  • Figure 3: Visual samples from the constructed Thermal-UAV dataset. The dataset systematically captures profound modality discrepancies and diurnal thermal variations across diverse spatial topologies: (a) Urban and (b) Rural Scenes showcase thermal UAV queries of distinct semantic categories (e.g., buildings, roads, fields) during daytime and nighttime; (c) Reference Map presents the global database at varying search scales, comprising visible-light satellite ortho-photos and their spatially aligned Digital Surface Models (DSM) to supply crucial 3D elevation priors.
  • Figure 4: Qualitative visualization of the proposed SCC-Loc pipeline in (a) Urban and (b) Rural scenarios. The process illustrates the adaptive correction of spatial quantization bias via the SGVA module (Re-Cropping) and the progressive elimination of structural outliers using the C-SATSF mechanism (Matching). Finally, based on the highly purified correspondences, the CD-RAPS strategy refines candidate poses via physically constrained optimization, and computes the total reliability (Final score) by fusing the multi-dimensional evaluation (Base score) with geographic consensus to determine the robust optimal hypothesis for precise Localization.