Table of Contents
Fetching ...

Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, Seung Wook Kim, Jun Gao, Laura Leal-Taixe, Mike Chen, Sanja Fidler, Huan Ling

TL;DR

Cosmos-Drive-Dreams tackles the data bottleneck in autonomous driving by leveraging post-trained world foundation models to generate controllable, multi-view driving videos and LiDAR data. The approach combines precise layout-conditioned video generation, single-view-to-multi-view expansion, in-the-wild annotation, and weather-aware LiDAR synthesis, augmented by an LLM-driven prompt rewriter and a VLM-based rejection filter. Empirical results show consistent improvements across 3D lane detection, 3D object detection, LiDAR-based detection, and policy learning, especially in long-tail and corner-case scenarios. The work provides open-source models, datasets, and toolkits to enable scalable synthetic data generation and rapid experimentation. Overall, Cosmos-Drive-Dreams offers a practical pathway to scale synthetic data for safer, more robust autonomous driving systems.

Abstract

Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams

Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

TL;DR

Cosmos-Drive-Dreams tackles the data bottleneck in autonomous driving by leveraging post-trained world foundation models to generate controllable, multi-view driving videos and LiDAR data. The approach combines precise layout-conditioned video generation, single-view-to-multi-view expansion, in-the-wild annotation, and weather-aware LiDAR synthesis, augmented by an LLM-driven prompt rewriter and a VLM-based rejection filter. Empirical results show consistent improvements across 3D lane detection, 3D object detection, LiDAR-based detection, and policy learning, especially in long-tail and corner-case scenarios. The work provides open-source models, datasets, and toolkits to enable scalable synthetic data generation and rapid experimentation. Overall, Cosmos-Drive-Dreams offers a practical pathway to scale synthetic data for safer, more robust autonomous driving systems.

Abstract

Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams

Paper Structure

This paper contains 30 sections, 1 equation, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Left: Autonomous Vehicle Data Flywheel enabled by Cosmos-Drive-Dreams. The cycle illustrates a continuous feedback loop for improving autonomous driving models with synthetic data generation. Right:Cosmos-Drive generates high-quality and diverse synthetic videos with multi-view and LiDAR modality support.
  • Figure 2: Overview of our Cosmos-Drive-Dreams pipeline. Starting from either structured labels or in-the-wild video, we generated pixel-aligned HDMap condition video (Step ➊). Then we leverage a prompt rewriter to generate diverse prompts and synthesize single-view videos (Step ➋). Each single-view video is then expanded into multiple views (Step ➌). Finally, a Vision-Language Model (VLM) filter performs rejection sampling to automatically discard low-quality samples, yielding a high-quality, diverse SDG dataset (Step ➍).
  • Figure 3: Cosmos-Drive's model suite.Top Left: We begin with a pretrained world foundation model (WFM) and post-train it on RDS dataset to obtain driving-specific WFMs. This model is further post-trained into three models, which constitute Cosmos-Drive. Top Right: Precise layout control model (Cosmos-Transfer1-7B-Sample-AV), which generates single-view driving videos from HDMap and optional LiDAR depth videos; Bottom Left: Multi-view expansion model (Cosmos-7B-Single2Multiview-Sample-AV), which synthesizes consistent multi-view videos from a single view; Bottom Right: In-the-wild video annotation model (Cosmos-7B-Annotate-Sample-AV), which predicts HDMap and depth from in-the-wild driving videos. Right: LiDAR generation model (Cosmos-7B-LiDAR-GEN-Sample-AV), which synthesizes high-quality LiDAR points conditioned on HDMap or RGB images.
  • Figure 4: Architecture diagram of Cosmos-Transfer1-7B-Sample-AV. We adopt DiT architecture peebles2023scalable for the generation model.
  • Figure 5: Precise layout control model (Cosmos-Transfer1-7B-Sample-AV) generates diverse and rare scenarios with the same HDMap but different text prompts, such as: The video captures a street scene during the day with a steady rain falling...; The scene unfolds in a chaotic environment as a fire engulfs the houses on either side of the street...; The scene is beautifully lined with blossoming sakura...
  • ...and 17 more figures