Table of Contents
Fetching ...

PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan

Abstract

We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion. Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset is available at: https://github.com/Choyaa/PiLoT.

PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

Abstract

We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion. Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset is available at: https://github.com/Choyaa/PiLoT.
Paper Structure (17 sections, 13 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 13 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of PiLoT. Our system takes a live video frame and a geo-referenced 3D map as input, and outputs 1) the UAV 6-DoF pose, visualized by the tight alignment in the AR overlays (bottom row), and 2) the 3D geo-coordinates of any target pixel, as shown in the dynamic target tracking example (left, filmstrip view). PiLoT achieves drift-free, real-time, and long-term ego and target geo-localization, demonstrated on a 10 km UAV trajectory with error color-coded (green: low, red: high). The system attains a median error of 1.37 m, a per-frame latency of 30 $\sim$ 40 ms, and 100% success rate across day-to-night and cross-season variations without GNSS and IMU signals.
  • Figure 2: PiLoT's Dual-Thread Framework. We decouple rendering from localization into two parallel threads. A Render Thread generates synthetic views, while a concurrent Localization Thread registers the live frame against them to compute the pose, ensuring high-frequency accuracy.
  • Figure 3: Overview of the PiLoT framework and localization pipeline.(a) The overall pipeline inputs a query frame and outputs the UAV's 6-DoF ego-pose along with the target's 3-DoF geo-location. (b) A highly efficient one-to-many paradigm matches multiple query hypotheses against a single rendered reference view via feature alignment. (c) Our coarse-to-fine optimizer iteratively narrows the search space to converge on the optimal 6-DoF pose. (d) The final estimated trajectory demonstrates robust and drift-free sequential localization.
  • Figure 4: Overview of our synthetic data generation and its resulting zero-shot sim-to-real performance. From left to right: (a) realistic UAV trajectories rendered over geo-referenced 3D tiles in Cesium for Unreal; (b) multi-condition diversity across weather/time and viewpoint (in-plane yaw, out-of-plane pitch/yaw, planar translation $T_x,T_y$, altitude $T_z$); (c) geometric consistency: we export absolute per-pixel depth and validate by reprojection; (d) our three-level feature pyramid on query (real) vs. reference (synthetic) images.
  • Figure 5: Rotation-aware sampling and coarse-to-fine optimization. The figure visualizes the pose convergence process in the pitch/yaw space: it synergizes wide-area Rotation-Aware Sampling (a) with parallel, coarse-to-fine refinement (b-d) to ensure robust convergence under aggressive motion.
  • ...and 5 more figures