Table of Contents
Fetching ...

ChatBEV: A Visual Language Model that Understands BEV Maps

Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, Ya Zhang

TL;DR

ChatBEV tackles BEV map understanding for traffic scenes by introducing ChatBEV-QA, a large BEV-focused VQA benchmark, and a fine-tuned BEV-oriented vision-language model. It combines an automated data construction pipeline with six BEV-centric question types to enable holistic reasoning about global scene context, vehicle-lane interactions, and vehicle-vehicle relationships. The work further integrates map-understanding into a diffusion-based, language-driven scene generation pipeline, enabling text-guided, realistic trajectory generation. Together, these contributions advance BEV-aware reasoning and controllable traffic scenario synthesis with practical implications for autonomous driving and SIM/OTC simulations.

Abstract

Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.

ChatBEV: A Visual Language Model that Understands BEV Maps

TL;DR

ChatBEV tackles BEV map understanding for traffic scenes by introducing ChatBEV-QA, a large BEV-focused VQA benchmark, and a fine-tuned BEV-oriented vision-language model. It combines an automated data construction pipeline with six BEV-centric question types to enable holistic reasoning about global scene context, vehicle-lane interactions, and vehicle-vehicle relationships. The work further integrates map-understanding into a diffusion-based, language-driven scene generation pipeline, enabling text-guided, realistic trajectory generation. Together, these contributions advance BEV-aware reasoning and controllable traffic scenario synthesis with practical implications for autonomous driving and SIM/OTC simulations.

Abstract

Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.

Paper Structure

This paper contains 17 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: We propose ChatBEV-QA, a scalable BEV VQA benchmark that encompasses comprehensive scene understanding tasks. Based on ChatBEV-QA, our fine-tuned ChatBEV model excels in scene understanding tasks and provides high-level guidance for subsequent applications like scene generation.
  • Figure 2: Illustration of the dataset construction pipeline and statistics.
  • Figure 3: The inference pipeline of our language-driven scene generation model.
  • Figure 4: Visualization results. Map understanding information helps to enhance generation accuracy and corner case handling.