ChatBEV: A Visual Language Model that Understands BEV Maps

Qingyao Xu; Siheng Chen; Guang Chen; Yanfeng Wang; Ya Zhang

ChatBEV: A Visual Language Model that Understands BEV Maps

Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, Ya Zhang

TL;DR

ChatBEV tackles BEV map understanding for traffic scenes by introducing ChatBEV-QA, a large BEV-focused VQA benchmark, and a fine-tuned BEV-oriented vision-language model. It combines an automated data construction pipeline with six BEV-centric question types to enable holistic reasoning about global scene context, vehicle-lane interactions, and vehicle-vehicle relationships. The work further integrates map-understanding into a diffusion-based, language-driven scene generation pipeline, enabling text-guided, realistic trajectory generation. Together, these contributions advance BEV-aware reasoning and controllable traffic scenario synthesis with practical implications for autonomous driving and SIM/OTC simulations.

Abstract

Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.

ChatBEV: A Visual Language Model that Understands BEV Maps

TL;DR

Abstract

ChatBEV: A Visual Language Model that Understands BEV Maps

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)