Embodied Scene Understanding for Vision Language Models via MetaVQA

Weizhen Wang; Chenda Duan; Zhenghao Peng; Yuxin Liu; Bolei Zhou

Embodied Scene Understanding for Vision Language Models via MetaVQA

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, Bolei Zhou

TL;DR

MetaVQA tackles the lack of a standardized benchmark for embodied scene understanding in Vision Language Models by introducing a large-scale VQA and closed-loop driving benchmark grounded in real-world nuScenes/Waymo data and simulated trajectories via MetaDrive. It leverages Set-of-Mark prompting to enable clear object grounding and constructs both open-loop and interactive evaluation pipelines, including a diverse set of 30 question types spanning spatial and embodied reasoning. The dataset comprises over 4 million questions across real and simulated contexts, with strong zero-shot grounding and demonstrable sim-to-real transfer; finetuned VLMs show improved spatial reasoning and safety-aware driving maneuvers in simulation. The work demonstrates robust sim-to-real generalization and provides extensive benchmarks for assessing and improving embodied scene understanding in driving, with public code and data slated for release.

Abstract

Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at https://metadriverse.github.io/metavqa .

Embodied Scene Understanding for Vision Language Models via MetaVQA

TL;DR

Abstract

Paper Structure (59 sections, 22 figures, 14 tables)

This paper contains 59 sections, 22 figures, 14 tables.

Introduction
Related Work
Constructing MetaVQA Dataset
Our Design Principles
VQA Generation Pipeline
Scenario Aggregation from Multiple Sources.
Set-of-Mark Prompting
Question-Answer Generation
MetaVQA Dataset
Dataset Composition
Zero-shot Answerability with Set-of-Mark Prompting
Transfer learning with simulated observations
Data Scalability of Learning
Benchmark Results
Visual Question Answering
...and 44 more sections

Figures (22)

Figure 1: Constructing MetaVQA benchmark. We extract scene graphs from real-world traffic scenarios collected from nuScenes and Waymo datasets(WOMD) and then feed them into question-type-dependent queries to generate ground-truth answers. The real and simulated RGB observations are processed with Set-of-Mark prompting. We evaluate the VLMs on both open-loop VQA tasks and closed-loop navigation tasks in simulation.
Figure 2: Set-of-Mark annotation process. For real-world images from the nuScenes (upper row) dataset, we cast the corresponding 3D bounding boxes into 2D space. For simulated images (lower row) rendered in MetaDrive, we extract 2D bounding boxes from the simulator's instance segmentation.
Figure 3: Question-answer generation pipeline. An illustrative example for generating the identify_distance question. Note that an additional "reasoning" field is generated along with the answer to improve VLM training. This field is not used in evaluation.
Figure 4: Left: Distribution of the question types. Right: Example for each question supertype.
Figure 5: Improved embodied scene understanding after fine-tuning of InternVL2-8B on the withheld training set from \ref{['sec:composition']}. The VLM demonstrates improved spatial understanding and embodied knowledge after learning the MetaVQA Dataset. In addition, the model attains better grounding capability.
...and 17 more figures

Embodied Scene Understanding for Vision Language Models via MetaVQA

TL;DR

Abstract

Embodied Scene Understanding for Vision Language Models via MetaVQA

Authors

TL;DR

Abstract

Table of Contents

Figures (22)