Table of Contents
Fetching ...

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler, Anton Obukhov

TL;DR

DGInStyle leverages large latent diffusion models to generate semantically labeled street-scene data for domain-generalized semantic segmentation in autonomous driving. It introduces Style Swap to separate semantic control from domain style, Style Prompting to diversify styles, and Multi-Resolution Latent Fusion to produce high-resolution details while preserving large objects. Empirical results show consistent, substantial gains across multiple DG baselines and backbones, achieving state-of-the-art performance on five target datasets. This work demonstrates the viability of diffusion-based data augmentation as a scalable path toward domain-robust dense scene understanding.

Abstract

Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Third, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods compared to the previous state-of-the-art methods. The source code and the generated dataset are available at https://dginstyle.github.io.

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

TL;DR

DGInStyle leverages large latent diffusion models to generate semantically labeled street-scene data for domain-generalized semantic segmentation in autonomous driving. It introduces Style Swap to separate semantic control from domain style, Style Prompting to diversify styles, and Multi-Resolution Latent Fusion to produce high-resolution details while preserving large objects. Empirical results show consistent, substantial gains across multiple DG baselines and backbones, achieving state-of-the-art performance on five target datasets. This work demonstrates the viability of diffusion-based data augmentation as a scalable path toward domain-robust dense scene understanding.

Abstract

Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Third, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods compared to the previous state-of-the-art methods. The source code and the generated dataset are available at https://dginstyle.github.io.
Paper Structure (19 sections, 14 figures, 7 tables)

This paper contains 19 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Crossing domain boundaries with DGInStyle. We propose a data-centric generative pipeline for domain generalization. It is derived from Stable Diffusion and augmented with a novel high-precision style-preserving semantic control. DGInStyle combines semantic masks (Query) with style prompts (e.g., Night or Rain) to generate training data for semantic segmentation networks with widely varying appearance. It achieves state-of-the-art semantic segmentation across domains in autonomous driving.
  • Figure 2: ControlNet learns the source domain style. This effect hinders varied data generation for domain generalization. Our Style Swap mitigates the effect and preserves the style prior.
  • Figure 3: Style variations. DGInStyle can generate images under various scene conditions through style prompting, while maintaining consistent dense semantic control from (a).
  • Figure 4: Overview of our proposed Style Swap technique. ControlNet learns segmentation-conditioned image generation on the source domain. To avoid ControlNet steering the generated style, it is trained on top of a source domain fine-tuned LDM. Later, this source domain LDM can be replaced with the original LDM to restore the rich style prior. As discussed in Sec. \ref{['sec:experiments']}, this technique leads to state-of-the-art results in domain generalization for semantic segmentation.
  • Figure 5: MRLF module. We generate a first-pass image $I$ using low-resolution conditioning. In the subsequent high-resolution pass, we partition the canvas into overlapping tiles at each generation step, concurrently apply denoising to each with its respective conditioning, and fuse them with a tile diffusion technique. Finally, we preserve the quality of large objects in the mask $\mathrm{M}$ with inpainting conditioned on the first pass image. The color gradient represents the path from noise to clean data.
  • ...and 9 more figures