Table of Contents
Fetching ...

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang

TL;DR

This work introduces TDBench, a top-down image understanding benchmark with 2000 rotation-augmented questions to evaluate vision-language models in aerial contexts. It proposes RotationalEval (RE), which requires consistent answers across four rotations $\{0^\circ,90^\circ,180^\circ,270^\circ\}$, to enforce rotational invariance and reduce luck-driven successes. Beyond accuracy, the authors formalize a reliability analysis using mixture parameters $\theta$, $r$, and $g$ to compute an Adjusted Accuracy $A_{adj}=\theta\,r$, disentangling grounded knowledge from guessing. Four real-world case studies demonstrate practical implications for digital zoom, altitude effects, partial visibility, and depth perception, illustrating how TDBench can guide deployment of robust, trustworthy VLMs in safety-critical aerial applications. The results highlight both capabilities and gaps in current VLMs and position reliability-aware evaluation as a crucial complement to conventional accuracy metrics.

Abstract

Top-down images play an important role in safety-critical settings such as autonomous navigation and aerial surveillance, where they provide holistic spatial information that front-view images cannot capture. Despite this, Vision Language Models (VLMs) are mostly trained and evaluated on front-view benchmarks, leaving their performance in the top-down setting poorly understood. Existing evaluations also overlook a unique property of top-down images: their physical meaning is preserved under rotation. In addition, conventional accuracy metrics can be misleading, since they are often inflated by hallucinations or "lucky guesses", which obscures a model's true reliability and its grounding in visual evidence. To address these issues, we introduce TDBench, a benchmark for top-down image understanding that includes 2000 curated questions for each rotation. We further propose RotationalEval (RE), which measures whether models provide consistent answers across four rotated views of the same scene, and we develop a reliability framework that separates genuine knowledge from chance. Finally, we conduct four case studies targeting underexplored real-world challenges. By combining rigorous evaluation with reliability metrics, TDBench not only benchmarks VLMs in top-down perception but also provides a new perspective on trustworthiness, guiding the development of more robust and grounded AI systems. Project homepage: https://github.com/Columbia-ICSL/TDBench

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

TL;DR

This work introduces TDBench, a top-down image understanding benchmark with 2000 rotation-augmented questions to evaluate vision-language models in aerial contexts. It proposes RotationalEval (RE), which requires consistent answers across four rotations , to enforce rotational invariance and reduce luck-driven successes. Beyond accuracy, the authors formalize a reliability analysis using mixture parameters , , and to compute an Adjusted Accuracy , disentangling grounded knowledge from guessing. Four real-world case studies demonstrate practical implications for digital zoom, altitude effects, partial visibility, and depth perception, illustrating how TDBench can guide deployment of robust, trustworthy VLMs in safety-critical aerial applications. The results highlight both capabilities and gaps in current VLMs and position reliability-aware evaluation as a crucial complement to conventional accuracy metrics.

Abstract

Top-down images play an important role in safety-critical settings such as autonomous navigation and aerial surveillance, where they provide holistic spatial information that front-view images cannot capture. Despite this, Vision Language Models (VLMs) are mostly trained and evaluated on front-view benchmarks, leaving their performance in the top-down setting poorly understood. Existing evaluations also overlook a unique property of top-down images: their physical meaning is preserved under rotation. In addition, conventional accuracy metrics can be misleading, since they are often inflated by hallucinations or "lucky guesses", which obscures a model's true reliability and its grounding in visual evidence. To address these issues, we introduce TDBench, a benchmark for top-down image understanding that includes 2000 curated questions for each rotation. We further propose RotationalEval (RE), which measures whether models provide consistent answers across four rotated views of the same scene, and we develop a reliability framework that separates genuine knowledge from chance. Finally, we conduct four case studies targeting underexplored real-world challenges. By combining rigorous evaluation with reliability metrics, TDBench not only benchmarks VLMs in top-down perception but also provides a new perspective on trustworthiness, guiding the development of more robust and grounded AI systems. Project homepage: https://github.com/Columbia-ICSL/TDBench

Paper Structure

This paper contains 53 sections, 16 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: (Left) Accuracy across ten top-down image tasks in TDBench. (Right) Knowledge decomposition analysis from TDBench: % of questions known ($\theta$) measures the proportion of questions a model truly knows; $P(Correct\vert Known)$ ($r$) is the model's accuracy among the questions that it knows; $P(Correct\vert Guessed)$ ($g$) is the model's accuracy among the questions it does not know; and the Adjusted Accuracy ($A_{\text{adj}}=\theta\cdot r$) is the model's accuracy without lucky guesses.
  • Figure 2: Benchmark examples across the ten categories in TDBench. Different colors indicate the three high-level capability groups: image perception (blue), single-instance understanding (green), and multi-instance reasoning (yellow). 'GT' refers to ground truth.
  • Figure 3: Proposed RotationalEval (RE) strategy. In RE, each image is rotated three times to create four questions, with choices generated separately for each rotation. We illustrate a failure case in object localization where four choices align with four images, and the VLM answers three correctly but fails on one.
  • Figure 4: Average RE performance of models on TDBench, aggregated across 10 evaluation dimensions for both Open-source and Proprietary models.
  • Figure 5: Impact of digital magnification on aerial object detection performance.
  • ...and 6 more figures