Table of Contents
Fetching ...

DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving

Hongbin Lin, Yiming Yang, Chaoda Zheng, Yifan Zhang, Shuaicheng Niu, Zilu Guo, Yafeng Li, Gui Gui, Shuguang Cui, Zhen Li

TL;DR

DriveFlow tackles the robustness gap in vision-centric 3D object detection under distribution shifts by proposing training-free data augmentation via rectified flow adaptation of pre-trained Text-to-Image flow models. It introduces two frequency-based strategies—foreground high-frequency preservation and background dual-frequency optimization—coupled with a set of losses to preserve 3D geometry while enabling rich background edits. Empirical results across monocular, multi-view, and temporal detectors on KITTI-C, nuScenes-C, and real-world transfers show consistent, substantial improvements and notable efficiency gains over inversion-based methods. The approach offers a practical, scalable solution for enhancing autonomous driving perception in diverse, real-world conditions.

Abstract

In autonomous driving, vision-centric 3D object detection recognizes and localizes 3D objects from RGB images. However, due to high annotation costs and diverse outdoor scenes, training data often fails to cover all possible test scenarios, known as the out-of-distribution (OOD) issue. Training-free image editing offers a promising solution for improving model robustness by training data enhancement without any modifications to pre-trained diffusion models. Nevertheless, inversion-based methods often suffer from limited effectiveness and inherent inaccuracies, while recent rectified-flow-based approaches struggle to preserve objects with accurate 3D geometry. In this paper, we propose DriveFlow, a Rectified Flow Adaptation method for training data enhancement in autonomous driving based on pre-trained Text-to-Image flow models. Based on frequency decomposition, DriveFlow introduces two strategies to adapt noise-free editing paths derived from text-conditioned velocities. 1) High-Frequency Foreground Preservation: DriveFlow incorporates a high-frequency alignment loss for foreground to maintain precise 3D object geometry. 2) Dual-Frequency Background Optimization: DriveFlow also conducts dual-frequency optimization for background, balancing editing flexibility and semantic consistency. Comprehensive experiments validate the effectiveness and efficiency of DriveFlow, demonstrating comprehensive performance improvements on all categories across OOD scenarios. Code is available at https://github.com/Hongbin98/DriveFlow.

DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving

TL;DR

DriveFlow tackles the robustness gap in vision-centric 3D object detection under distribution shifts by proposing training-free data augmentation via rectified flow adaptation of pre-trained Text-to-Image flow models. It introduces two frequency-based strategies—foreground high-frequency preservation and background dual-frequency optimization—coupled with a set of losses to preserve 3D geometry while enabling rich background edits. Empirical results across monocular, multi-view, and temporal detectors on KITTI-C, nuScenes-C, and real-world transfers show consistent, substantial improvements and notable efficiency gains over inversion-based methods. The approach offers a practical, scalable solution for enhancing autonomous driving perception in diverse, real-world conditions.

Abstract

In autonomous driving, vision-centric 3D object detection recognizes and localizes 3D objects from RGB images. However, due to high annotation costs and diverse outdoor scenes, training data often fails to cover all possible test scenarios, known as the out-of-distribution (OOD) issue. Training-free image editing offers a promising solution for improving model robustness by training data enhancement without any modifications to pre-trained diffusion models. Nevertheless, inversion-based methods often suffer from limited effectiveness and inherent inaccuracies, while recent rectified-flow-based approaches struggle to preserve objects with accurate 3D geometry. In this paper, we propose DriveFlow, a Rectified Flow Adaptation method for training data enhancement in autonomous driving based on pre-trained Text-to-Image flow models. Based on frequency decomposition, DriveFlow introduces two strategies to adapt noise-free editing paths derived from text-conditioned velocities. 1) High-Frequency Foreground Preservation: DriveFlow incorporates a high-frequency alignment loss for foreground to maintain precise 3D object geometry. 2) Dual-Frequency Background Optimization: DriveFlow also conducts dual-frequency optimization for background, balancing editing flexibility and semantic consistency. Comprehensive experiments validate the effectiveness and efficiency of DriveFlow, demonstrating comprehensive performance improvements on all categories across OOD scenarios. Code is available at https://github.com/Hongbin98/DriveFlow.

Paper Structure

This paper contains 19 sections, 14 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison on KITTI-C based on MonoFlex. DriveFlow achieves 1) better performance with only Snow augmentation (orange) than DriveGEN with 6 aug. (purple) and 2) comprehensive gains on the minority class (Pedestrian) across OOD scenarios. Better viewed in color.
  • Figure 2: An illustration of DriveFlow for training data enhancement in vision-centric 3D object detection. In contrast to the inversion-based approach DriveGEN, DriveFlow conducts rectified flow adaptation based on pre-trained T2I flow models (e.g., Stable Diffusion 3), thereby achieving comprehensive improvement and rapid generation for 3D detectors.
  • Figure 3: Due to the lack of foreground constraints, FlowEdit kulikov2024flowedit often fails to maintain 3D objects even with text descriptions from Qwen2.5-VL bai2025qwen2, while DriveFlow only requires the target scene conditions and image layouts (i.e., 2D bounding boxes). Note that foreground preservation enables annotation reuse for augmented training.
  • Figure 4: An illustration of DriveFlow. Without modification of the pre-trained model, DriveFlow employs frequency-based decomposition for both velocity fields $V_t^{src}$ and $V_t^{tar}$, and then applies: 1) High-Frequency Foreground Preservation, applying a L2 alignment loss to align high-frequency contents between velocity fields explicitly. 2) Dual-Frequency Background Optimization, introducing dual-frequency optimization for background areas to ensure editing flexibility and semantic consistency.
  • Figure 5: Ablation studies on the loss terms $\mathcal{L}_{{obj}}$, $\mathcal{L}_{{div}}$ and $\mathcal{L}_{{bg}}$. More results are available in Appendix E.
  • ...and 4 more figures