Table of Contents
Fetching ...

SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, Sandeep Chinchali

TL;DR

SynDiff-AD targets the pervasive problem of under-represented driving conditions in semantic segmentation and end-to-end autonomous driving by generating semantically consistent, subgroup-specific synthetic data. It combines latent diffusion models with ControlNet conditioned on semantic maps and a novel CaG prompting scheme to produce images that balance datasets across weather and lighting subgroups, without additional labeling. Empirical results on Waymo, DeepDrive, and CARLA show improvements in segmentation metrics (up to 2.3 mIoU) and driving performance (up to ~20% DS) across diverse conditions, with ablations confirming the value of CaG in enhancing synthetic data quality. The approach reduces labeling costs and enhances model robustness, though it remains limited to single-view data and does not explore adversarial or multi-view data generation for further gains.

Abstract

In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as "Clear and Day" weather, leading to decreased performance in under-represented conditions like "Rainy and Night". To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model. We release our code and pipeline at https://github.com/UTAustin-SwarmLab/SynDiff-AD.

SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

TL;DR

SynDiff-AD targets the pervasive problem of under-represented driving conditions in semantic segmentation and end-to-end autonomous driving by generating semantically consistent, subgroup-specific synthetic data. It combines latent diffusion models with ControlNet conditioned on semantic maps and a novel CaG prompting scheme to produce images that balance datasets across weather and lighting subgroups, without additional labeling. Empirical results on Waymo, DeepDrive, and CARLA show improvements in segmentation metrics (up to 2.3 mIoU) and driving performance (up to ~20% DS) across diverse conditions, with ablations confirming the value of CaG in enhancing synthetic data quality. The approach reduces labeling costs and enhances model robustness, though it remains limited to single-view data and does not explore adversarial or multi-view data generation for further gains.

Abstract

In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as "Clear and Day" weather, leading to decreased performance in under-represented conditions like "Rainy and Night". To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model. We release our code and pipeline at https://github.com/UTAustin-SwarmLab/SynDiff-AD.

Paper Structure

This paper contains 15 sections, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Our proposed synthetic data augmentation method -- SynDiff-AD. We begin by identifying subgroups of images in the dataset using a pre-trained vision-language model like CLIP radford2021learning (Step 1). Then, we create captions with our proposed caption generation scheme to synthesize images using a controlled latent diffusion model for under-represented subgroups (Step 2). Finally, the generated synthetic data is used to augment the original dataset to fine-tune task-specific models (Step 3).
  • Figure 2: Caption Generation Pipeline. Our proposed caption generation scheme is used to fine-tune ControlNet and subsequently generate synthetic data. We query LLaVA with a prompt constructed from the semantic mask to obtain a caption for its corresponding image. A subgroup, either associated with the image or a target under-represented condition, is appended to this caption for fine-tuning ControlNet or image synthesis, respectively.
  • Figure 3: Ablation of synthesized images on the Waymo dataset. We qualitatively visualize synthetic images for "Rainy and Night" weather for an image from the Waymo dataset taken in "Clear and Day" conditions. We generate images from a base ControlNet model and a fine-tuned ControlNet model, with text prompts obtained with and without our subgroup-specific caption generation scheme (CaG). First, as seen from images in columns 1 and 2, the base ControlNet model yields synthetic images that are hyper-realistic and not dataset-specific. The fine-tuned ControlNet model generates more dataset-specific images as observed by artifacts such as the blurring due to raindrops in rainy weather and realistic backgrounds. Additionally, we show that CaG further improves the quality of synthesized images. For instance, images generated without CaG using a fine-tuned ControlNet in rows 2 and 3 show unnatural generative artifacts on cars with missing pavements and lane markers. These artifacts are absent in the synthesized image via CaG.
  • Figure 4: The distribution of identified subgroups in the original and augmented dataset for segmentation experiments on the Waymo dataset.
  • Figure 5: The distribution of the identified subgroups in the original and augmented dataset for segmentation experiments on the DeepDrive dataset.
  • ...and 4 more figures