Embodied Scene Understanding for Vision Language Models via MetaVQA
Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, Bolei Zhou
TL;DR
MetaVQA tackles the lack of a standardized benchmark for embodied scene understanding in Vision Language Models by introducing a large-scale VQA and closed-loop driving benchmark grounded in real-world nuScenes/Waymo data and simulated trajectories via MetaDrive. It leverages Set-of-Mark prompting to enable clear object grounding and constructs both open-loop and interactive evaluation pipelines, including a diverse set of 30 question types spanning spatial and embodied reasoning. The dataset comprises over 4 million questions across real and simulated contexts, with strong zero-shot grounding and demonstrable sim-to-real transfer; finetuned VLMs show improved spatial reasoning and safety-aware driving maneuvers in simulation. The work demonstrates robust sim-to-real generalization and provides extensive benchmarks for assessing and improving embodied scene understanding in driving, with public code and data slated for release.
Abstract
Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at https://metadriverse.github.io/metavqa .
