Table of Contents
Fetching ...

DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

Yongjie Fu, Anmol Jain, Xuan Di, Xu Chen, Zhaobin Mo

TL;DR

DriveGenVLM presents a conditional denoising diffusion probabilistic model (DDPM) framework to synthesize driving videos and assesses their utility for vision-language models (VLMs) by leveraging in-context learning with EILEV. The method uses a U-Net based DDPM conditioned on observed frames to generate longer, coherent driving sequences, evaluated on the Waymo Open Dataset with multiple camera viewpoints. Evaluation via Fréchet Video Distance (FVD) shows adaptive Hierarchy-2 sampling yields the best realism and temporal coherence, and in-context narrations from EILEV demonstrate that the generated videos are interpretable by VLMs. This work highlights a practical pathway to fuse generative video modeling with VLMs to support perception, narration, and planning in autonomous driving systems.

Abstract

The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fréchet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.

DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

TL;DR

DriveGenVLM presents a conditional denoising diffusion probabilistic model (DDPM) framework to synthesize driving videos and assesses their utility for vision-language models (VLMs) by leveraging in-context learning with EILEV. The method uses a U-Net based DDPM conditioned on observed frames to generate longer, coherent driving sequences, evaluated on the Waymo Open Dataset with multiple camera viewpoints. Evaluation via Fréchet Video Distance (FVD) shows adaptive Hierarchy-2 sampling yields the best realism and temporal coherence, and in-context narrations from EILEV demonstrate that the generated videos are interpretable by VLMs. This work highlights a practical pathway to fuse generative video modeling with VLMs to support perception, narration, and planning in autonomous driving systems.

Abstract

The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fréchet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.
Paper Structure (16 sections, 3 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 3 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Process of DDPM model.
  • Figure 2: Training Framework Employing U-Net with Diffusion Probabilistic Model (DDPM) Integration.
  • Figure 3: Architecture of EILEV.
  • Figure 4: Front Camera - FVD Score: 1174.
  • Figure 5: Front-left Camera - FVD Score: 812.
  • ...and 2 more figures