UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Baichuan Zhou; Haote Yang; Dairong Chen; Junyan Ye; Tianyi Bai; Jinhua Yu; Songyang Zhang; Dahua Lin; Conghui He; Weijia Li

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua Yu, Songyang Zhang, Dahua Lin, Conghui He, Weijia Li

TL;DR

UrBench presents a multi-view urban benchmark to rigorously evaluate Large Multimodal Models across 14 tasks in four dimensions, incorporating cross-view data to test cross-perspective reasoning. It introduces a cross-view detection-matching pipeline for instance-level annotations and uses model-, rule-, and human-based question generation to build 11.6K high-quality questions. Evaluations on 21 LMMs show substantial gaps to human performance, especially in cross-view tasks, with the best models lagging behind by an average of 17.4% and demonstrating view-dependent inconsistencies. The work highlights the need for multi-view pretraining and urban-centric data to improve LMMs’ capabilities in real-world city scenarios, offering a benchmark to guide future research and development.

Abstract

Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations.

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

TL;DR

Abstract

Paper Structure (14 sections, 6 figures, 1 table)

This paper contains 14 sections, 6 figures, 1 table.

Introduction
Related Work
Large Multimodal Models
Multimodal Benchmarks
UrBench
Benchmark Analysis
Benchmark Tasks
Benchmark Curation
Experiments
Evaluation Setups
Main Results
Detailed Analysis
Conclusion
Acknowledgments

Figures (6)

Figure 1: Comparison between UrBench and previous works. (1) UrBench contains both region-level and role-level questions, while previous benchmarks generally focus on region-level questions. (2) In addition to single-view questions in satellite or street view, UrBench also incorporates cross-view questions. (3) It evaluates LMMs on a comprehensive range of 14 diverse tasks in 4 evaluation dimensions.
Figure 2: The performances of the 5 leading LMMs, as well as that of the human and random guess, on UrBench.
Figure 3: (a) The 14 types of tasks under 4 evaluation dimensions. (b) The view types of each task. (c) The statistics of UrBench. cross, sat, and str are the abbreviations for cross-view, satellite-view, and street-view. mono, pano, and multi are the abbreviations of monocular, panoramic, and multiple. MC and open means multiple-choice and open-ended, respectively.
Figure 4: UrBench consists of 14 different task types, categorized into four evaluation dimensions based on the capacities and the granularity of the objects of interest assessed by the questions.
Figure 5: UrBench curation pipeline includes data collection, data pre-processing, question generation and quality control.
...and 1 more figures

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

TL;DR

Abstract

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (6)