TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

Kaiyuan Hou; Minghui Zhao; Lilin Xu; Yuang Fan; Xiaofan Jiang

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang

TL;DR

This work introduces TDBench, a top-down image understanding benchmark with 2000 rotation-augmented questions to evaluate vision-language models in aerial contexts. It proposes RotationalEval (RE), which requires consistent answers across four rotations $\{0^\circ,90^\circ,180^\circ,270^\circ\}$, to enforce rotational invariance and reduce luck-driven successes. Beyond accuracy, the authors formalize a reliability analysis using mixture parameters $\theta$, $r$, and $g$ to compute an Adjusted Accuracy $A_{adj}=\theta\,r$, disentangling grounded knowledge from guessing. Four real-world case studies demonstrate practical implications for digital zoom, altitude effects, partial visibility, and depth perception, illustrating how TDBench can guide deployment of robust, trustworthy VLMs in safety-critical aerial applications. The results highlight both capabilities and gaps in current VLMs and position reliability-aware evaluation as a crucial complement to conventional accuracy metrics.

Abstract

Top-down images play an important role in safety-critical settings such as autonomous navigation and aerial surveillance, where they provide holistic spatial information that front-view images cannot capture. Despite this, Vision Language Models (VLMs) are mostly trained and evaluated on front-view benchmarks, leaving their performance in the top-down setting poorly understood. Existing evaluations also overlook a unique property of top-down images: their physical meaning is preserved under rotation. In addition, conventional accuracy metrics can be misleading, since they are often inflated by hallucinations or "lucky guesses", which obscures a model's true reliability and its grounding in visual evidence. To address these issues, we introduce TDBench, a benchmark for top-down image understanding that includes 2000 curated questions for each rotation. We further propose RotationalEval (RE), which measures whether models provide consistent answers across four rotated views of the same scene, and we develop a reliability framework that separates genuine knowledge from chance. Finally, we conduct four case studies targeting underexplored real-world challenges. By combining rigorous evaluation with reliability metrics, TDBench not only benchmarks VLMs in top-down perception but also provides a new perspective on trustworthiness, guiding the development of more robust and grounded AI systems. Project homepage: https://github.com/Columbia-ICSL/TDBench

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

TL;DR

Abstract

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)