Table of Contents
Fetching ...

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Hongbin Lin, Zilu Guo, Yifan Zhang, Shuaicheng Niu, Yafeng Li, Ruimao Zhang, Shuguang Cui, Zhen Li

TL;DR

DriveGEN tackles robustness of vision-centric 3D detectors under distribution shifts by leveraging training-free controllable Text-to-Image diffusion to augment training data. It introduces a two-stage framework: Self-Prototype Extraction, which encodes precise object geometry via layouts and PCA on self-attention, and Prototype-Guided Diffusion, which preserves objects through semantic-aware and shallow feature alignment during denoising. Empirical results on KITTI-C and nuScenes show substantial OOD improvements with no diffusion-model training, outperforming both training-based and training-free baselines. The approach reduces data-collection costs while enhancing generalization across diverse weather and scenes, making diffusion-based augmentation viable for autonomous-driving perception.

Abstract

In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at https://github.com/Hongbin98/DriveGEN.

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

TL;DR

DriveGEN tackles robustness of vision-centric 3D detectors under distribution shifts by leveraging training-free controllable Text-to-Image diffusion to augment training data. It introduces a two-stage framework: Self-Prototype Extraction, which encodes precise object geometry via layouts and PCA on self-attention, and Prototype-Guided Diffusion, which preserves objects through semantic-aware and shallow feature alignment during denoising. Empirical results on KITTI-C and nuScenes show substantial OOD improvements with no diffusion-model training, outperforming both training-based and training-free baselines. The approach reduces data-collection costs while enhancing generalization across diverse weather and scenes, making diffusion-based augmentation viable for autonomous-driving perception.

Abstract

In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at https://github.com/Hongbin98/DriveGEN.

Paper Structure

This paper contains 19 sections, 6 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: An illustration of DriveGEN for enhancing vision-centric 3D detectors. DriveGEN extends original training images to diverse Out-of-Distribution (OOD) scenarios without additional diffusion model training, preserving all annotated objects. Even with a single augmentation, i.e., 'Snow', the augmented detector zhang2021objects is improved not only on the original KITTI geiger2012we val set (Ideal) and already known 'Snow' scenario, but also in unseen 'Fog' and 'Defocus' scenarios, demonstrating a comprehensive model generalizability improvement.
  • Figure 2: Illustration of the importance of object preservation in 3d detection based on KITTI. ControlNet zhang2023adding suffers object orientation errors and omissions even if accurate segmentation masks ravi2024sam and rich prompts chen2024internvl are provided, showing that training-based methods may struggle with spatial control with limited training data. Additionally, Freecontrol mo2024freecontrol relies on coarse low-resolution features to capture semantic structures that may encounter potential object geometry loss, resulting in object position misalignment and omission issues.
  • Figure 3: An overview of our DriveGEN , consisting of two stages: 1) The Self-Prototype Extraction stage is devised to extract accurate semantic structures of multiple objects. To capture precise locations, we achieve fine-grained self-prototypes via leveraging layouts and the peak function to re-weight object regions rather than directly using coarse self-attention features. 2) The Prototype-Guided Diffusion stage conducts semantic-aware feature alignment for semantic structure matching and shallow feature alignment for tiny object preservation.
  • Figure 4: Based on MonoCD yan2024monocd, we provide more comparisons with baselines on KITTI-C, regarding Mean $AP_{3D|R_{40}}$.
  • Figure 5: Ablation studies on semantic feature alignment loss $g_{sa}$, self-prototypes $\hat{\mathbf{P}}_t$ and shallow feature alignment loss $g_{sl}$.
  • ...and 9 more figures