Table of Contents
Fetching ...

VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

Yifei Chen, Xupeng Chen, Feng Wang, Niangang Jiao, Jiayin Liu

TL;DR

VANGUARD is proposed, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length.

Abstract

Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to reason about physical dimensions becomes safety-critical -- yet our experiments show that five state-of-the-art VLMs suffer from spatial scale hallucinations, with median area estimation errors exceeding 50%. We propose VANGUARD, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance (GSD) from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length. The tool returns both a GSD estimate and a composite confidence score, enabling the calling agent to autonomously decide whether to trust the measurement or fall back to alternative strategies. On the DOTA~v1.5 benchmark, VANGUARD achieves 6.87% median GSD error on 306~images. Integrated with SAM-based segmentation for downstream area measurement, the pipeline yields 19.7% median error on a 100-entry benchmark -- with 2.6x lower category dependence and 4x fewer catastrophic failures than the best VLM baseline -- demonstrating that equipping agents with deterministic geometric tools is essential for safe autonomous spatial reasoning.

VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

TL;DR

VANGUARD is proposed, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length.

Abstract

Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to reason about physical dimensions becomes safety-critical -- yet our experiments show that five state-of-the-art VLMs suffer from spatial scale hallucinations, with median area estimation errors exceeding 50%. We propose VANGUARD, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance (GSD) from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length. The tool returns both a GSD estimate and a composite confidence score, enabling the calling agent to autonomously decide whether to trust the measurement or fall back to alternative strategies. On the DOTA~v1.5 benchmark, VANGUARD achieves 6.87% median GSD error on 306~images. Integrated with SAM-based segmentation for downstream area measurement, the pipeline yields 19.7% median error on a 100-entry benchmark -- with 2.6x lower category dependence and 4x fewer catastrophic failures than the best VLM baseline -- demonstrating that equipping agents with deterministic geometric tools is essential for safe autonomous spatial reasoning.
Paper Structure (23 sections, 8 equations, 4 figures, 3 tables)

This paper contains 23 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Our Geometric Perception Skill in action on a DOTA v1.5 image (P2593, Ground Track Field). (a) A UAV captures an image without metadata; the target object (red box) must be measured from pixels alone. (b) Our skill detects small vehicles via oriented bounding boxes (green) and derives $\mathrm{GSD} = 0.241$ m/px from their modal pixel length. (c) Combined with SAM segmentation, the pipeline predicts the field area within 1.8% of ground truth, whereas GPT-4 o hallucinates a 50% underestimate.
  • Figure 2: System architecture as an Embodied Agent Perception Loop. Left: a UAV operating in a GPS-denied environment captures monocular imagery. Centre (dashed box): our Deterministic Geometric Skill processes the image through YOLO-OBB detection, outlier filtering, KDE mode estimation, and a resolution guard to produce a calibrated GSD with confidence score. Right: the LLM/VLM Planner can either rely on direct visual estimation (red path), suffering from Spatial Scale Hallucination ($>$50% error), or invoke our Geometric Skill (green path) for safe metric planning.
  • Figure 3: End-to-end GSD estimation results (YOLO11l-OBB, 306 images). (a) Predicted vs. ground-truth GSD for images with GSD $< 0.55$ m/px (271 of 306 images). The shaded band denotes $\pm 10\%$ error; the horizontal dotted line at $\mathrm{GSD}_{\mathrm{pred}} \approx 0.31$ m/px marks the resolution guard limit ($P_{\mathrm{mode}} \approx 16$ px), beyond which predictions saturate. (b) Cumulative error distribution: 66.0% of images achieve $< 10\%$ error (green dashed) and 83.3% achieve $< 20\%$ error (amber dashed). The long tail corresponds primarily to low-resolution images (GSD $> 0.3$ m/px).
  • Figure 4: Qualitative comparison on three benchmark entries. Top: annotated input with GPT-4 o predictions (red). Bottom: RS-GSD pipeline with vehicle OBBs (green) and SAM segmentation overlay. Our geometric approach achieves 0.7--4.3% error vs. GPT-4 o's 50--58% error, illustrating the spatial scale hallucination phenomenon.