VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

Yifei Chen; Xupeng Chen; Feng Wang; Niangang Jiao; Jiayin Liu

VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

Yifei Chen, Xupeng Chen, Feng Wang, Niangang Jiao, Jiayin Liu

TL;DR

VANGUARD is proposed, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length.

Abstract

Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to reason about physical dimensions becomes safety-critical -- yet our experiments show that five state-of-the-art VLMs suffer from spatial scale hallucinations, with median area estimation errors exceeding 50%. We propose VANGUARD, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance (GSD) from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length. The tool returns both a GSD estimate and a composite confidence score, enabling the calling agent to autonomously decide whether to trust the measurement or fall back to alternative strategies. On the DOTA~v1.5 benchmark, VANGUARD achieves 6.87% median GSD error on 306~images. Integrated with SAM-based segmentation for downstream area measurement, the pipeline yields 19.7% median error on a 100-entry benchmark -- with 2.6x lower category dependence and 4x fewer catastrophic failures than the best VLM baseline -- demonstrating that equipping agents with deterministic geometric tools is essential for safe autonomous spatial reasoning.

VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 4 figures, 3 tables)

This paper contains 23 sections, 8 equations, 4 figures, 3 tables.

INTRODUCTION
RELATED WORK
GSD and scale estimation.
VLM spatial reasoning deficiencies.
METHOD
Overview
Reference Length Determination
Vehicle Detection
Outlier Filtering
KDE Mode Estimation
Confidence Evaluation and Safety Fallback
Autonomous safety fallback.
EXPERIMENTS
Dataset and Setup
Vehicle Detection and GSD Evaluation
...and 8 more sections

Figures (4)

Figure 1: Our Geometric Perception Skill in action on a DOTA v1.5 image (P2593, Ground Track Field). (a) A UAV captures an image without metadata; the target object (red box) must be measured from pixels alone. (b) Our skill detects small vehicles via oriented bounding boxes (green) and derives $\mathrm{GSD} = 0.241$ m/px from their modal pixel length. (c) Combined with SAM segmentation, the pipeline predicts the field area within 1.8% of ground truth, whereas GPT-4 o hallucinates a 50% underestimate.
Figure 2: System architecture as an Embodied Agent Perception Loop. Left: a UAV operating in a GPS-denied environment captures monocular imagery. Centre (dashed box): our Deterministic Geometric Skill processes the image through YOLO-OBB detection, outlier filtering, KDE mode estimation, and a resolution guard to produce a calibrated GSD with confidence score. Right: the LLM/VLM Planner can either rely on direct visual estimation (red path), suffering from Spatial Scale Hallucination ($>$50% error), or invoke our Geometric Skill (green path) for safe metric planning.
Figure 3: End-to-end GSD estimation results (YOLO11l-OBB, 306 images). (a) Predicted vs. ground-truth GSD for images with GSD $< 0.55$ m/px (271 of 306 images). The shaded band denotes $\pm 10\%$ error; the horizontal dotted line at $\mathrm{GSD}_{\mathrm{pred}} \approx 0.31$ m/px marks the resolution guard limit ($P_{\mathrm{mode}} \approx 16$ px), beyond which predictions saturate. (b) Cumulative error distribution: 66.0% of images achieve $< 10\%$ error (green dashed) and 83.3% achieve $< 20\%$ error (amber dashed). The long tail corresponds primarily to low-resolution images (GSD $> 0.3$ m/px).
Figure 4: Qualitative comparison on three benchmark entries. Top: annotated input with GPT-4 o predictions (red). Bottom: RS-GSD pipeline with vehicle OBBs (green) and SAM segmentation overlay. Our geometric approach achieves 0.7--4.3% error vs. GPT-4 o's 50--58% error, illustrating the spatial scale hallucination phenomenon.

VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

TL;DR

Abstract

VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (4)