Table of Contents
Fetching ...

Scenario Understanding of Traffic Scenes Through Large Visual Language Models

Esteban Rivera, Jannik Lübberstedt, Nico Uhlemann, Markus Lienkamp

TL;DR

This work tackles the domain generalization challenge in autonomous driving by using Large Visual Language Models (LVLMs) to automatically categorize traffic scenes into meaningful tags. It evaluates a diverse set of LVLMs (e.g., GPT-4 Vision, LLaVA, Composer-HD, CogVLM) on an in-house German dataset and the public BDD100K, employing a scalable captioning/prompting pipeline and measuring performance with accuracy and macro F1-score. The study finds LVLMs can robustly handle several detection and reasoning tasks, outperforming traditional CNN baselines in F1 on some categories and offering a data-efficient path to cross-domain scene understanding, while noting limitations in single-frame perception and certain ambiguous categories. It also highlights a practical inference-time analysis, identifying trade-offs between model quality and processing speed, and suggests future work integrating temporal context and multimodal data to further enhance performance in real-world autonomous-driving pipelines.

Abstract

Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets. Our analysis, combining quantitative metrics with qualitative insights, demonstrates the effectiveness of LVLMs to understand urban traffic scenarios and highlights their potential as an efficient tool for data-driven advancements in autonomous driving.

Scenario Understanding of Traffic Scenes Through Large Visual Language Models

TL;DR

This work tackles the domain generalization challenge in autonomous driving by using Large Visual Language Models (LVLMs) to automatically categorize traffic scenes into meaningful tags. It evaluates a diverse set of LVLMs (e.g., GPT-4 Vision, LLaVA, Composer-HD, CogVLM) on an in-house German dataset and the public BDD100K, employing a scalable captioning/prompting pipeline and measuring performance with accuracy and macro F1-score. The study finds LVLMs can robustly handle several detection and reasoning tasks, outperforming traditional CNN baselines in F1 on some categories and offering a data-efficient path to cross-domain scene understanding, while noting limitations in single-frame perception and certain ambiguous categories. It also highlights a practical inference-time analysis, identifying trade-offs between model quality and processing speed, and suggests future work integrating temporal context and multimodal data to further enhance performance in real-world autonomous-driving pipelines.

Abstract

Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets. Our analysis, combining quantitative metrics with qualitative insights, demonstrates the effectiveness of LVLMs to understand urban traffic scenarios and highlights their potential as an efficient tool for data-driven advancements in autonomous driving.

Paper Structure

This paper contains 24 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example for a common traffic scenario contained in our dataset, depicting the interaction of the ego vehicle with a group of pedestrians crossing over a crosswalk on a sunny day and a silver car parked on the sidewalk towards the left.
  • Figure 2: Workflow of our three-step prompt engineering approach based on the example of vision imparing brightness.
  • Figure 3: Accuracy and F1-score for the best model of each architecture
  • Figure 4: Examples for false positives in the traffic light category. Here, Composer-HD mistakes illuminated traffic signs (left) or red rear lights of vehicles (right) with actual traffic lights.
  • Figure 5: Example scenarios where the actualy weather condition is difficult to determine.
  • ...and 4 more figures