Table of Contents
Fetching ...

DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving

Jiahang Tu, Wei Ji, Hanbin Zhao, Chao Zhang, Roger Zimmermann, Hui Qian

TL;DR

DriveDiTFit is proposed, a novel method for efficiently generating autonomous Driving data by Fine-tuning pre-trained Diffusion Transformers (DiTs) according to the discrepancy between the pre-trained source data and the target driving data.

Abstract

In autonomous driving, deep models have shown remarkable performance across various visual perception tasks with the demand of high-quality and huge-diversity training datasets. Such datasets are expected to cover various driving scenarios with adverse weather, lighting conditions and diverse moving objects. However, manually collecting these data presents huge challenges and expensive cost. With the rapid development of large generative models, we propose DriveDiTFit, a novel method for efficiently generating autonomous Driving data by Fine-tuning pre-trained Diffusion Transformers (DiTs). Specifically, DriveDiTFit utilizes a gap-driven modulation technique to carefully select and efficiently fine-tune a few parameters in DiTs according to the discrepancy between the pre-trained source data and the target driving data. Additionally, DriveDiTFit develops an effective weather and lighting condition embedding module to ensure diversity in the generated data, which is initialized by a nearest-semantic-similarity initialization approach. Through progressive tuning scheme to refined the process of detail generation in early diffusion process and enlarging the weights corresponding to small objects in training loss, DriveDiTFit ensures high-quality generation of small moving objects in the generated data. Extensive experiments conducted on driving datasets confirm that our method could efficiently produce diverse real driving data. The source codes will be available at https://github.com/TtuHamg/DriveDiTFit.

DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving

TL;DR

DriveDiTFit is proposed, a novel method for efficiently generating autonomous Driving data by Fine-tuning pre-trained Diffusion Transformers (DiTs) according to the discrepancy between the pre-trained source data and the target driving data.

Abstract

In autonomous driving, deep models have shown remarkable performance across various visual perception tasks with the demand of high-quality and huge-diversity training datasets. Such datasets are expected to cover various driving scenarios with adverse weather, lighting conditions and diverse moving objects. However, manually collecting these data presents huge challenges and expensive cost. With the rapid development of large generative models, we propose DriveDiTFit, a novel method for efficiently generating autonomous Driving data by Fine-tuning pre-trained Diffusion Transformers (DiTs). Specifically, DriveDiTFit utilizes a gap-driven modulation technique to carefully select and efficiently fine-tune a few parameters in DiTs according to the discrepancy between the pre-trained source data and the target driving data. Additionally, DriveDiTFit develops an effective weather and lighting condition embedding module to ensure diversity in the generated data, which is initialized by a nearest-semantic-similarity initialization approach. Through progressive tuning scheme to refined the process of detail generation in early diffusion process and enlarging the weights corresponding to small objects in training loss, DriveDiTFit ensures high-quality generation of small moving objects in the generated data. Extensive experiments conducted on driving datasets confirm that our method could efficiently produce diverse real driving data. The source codes will be available at https://github.com/TtuHamg/DriveDiTFit.
Paper Structure (16 sections, 11 equations, 8 figures, 5 tables)

This paper contains 16 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: There is an apparent discrepancy between pre-trained datasets and driving scenario datasets. Pre-trained datasets usually feature certain categories of objects prominently displayed within the images, which is similar to the fine-grained classification datasets, such as CUB-200-2011 and Oxford Flowers. However, driving scenario datasets are more complex and contain multiple objects, including roads, vehicles and buildings, with diverse weather and lighting conditions.
  • Figure 2: The object information can be generated sufficiently in the denoising processsimple, when it loses slowly in the diffusion process. The conventional noise scheduledit makes big objects in classification dataset lose slowly (top row, clock, t from 0 to 400), whereas it causes smaller objects in driving data to fade more rapidly (bottom row, vehicles, t from 0 to 200). An appropriate noise schedule is necessary for driving data generation.
  • Figure 3: Our framework for diverse driving scenario generation consists of three key components: i) Gap-driven modulation techniques on the condition MLP and attention blocks; (Sec. 3.2); ii) Accelerating convergence and enhancing quality by initiating with high semantic similarity embeddings via a CLIP encoder (Sec. 3.3); iii) Adopting progressive tuning scheme with novel Scos noise schedule (Sec. 3.4.1) and applying vehicle bounding box masks on training loss (Sec. 3.4.2) for precise object representation.
  • Figure 4: In the diffusion process, $\beta_t$ varies between the cosine noise schedule proposed by Nichol and Dhariwal and Scos noise schedule.
  • Figure 5: Left: The linear noise schedule(top row) and $\mathrm{Scos^2}$ noise schedule(bottom row) are adopted on an driving scenario sample. Right: $\bar{\alpha}_t$ in diffusion process for the linear noise schedule and Scos noise schedule with different powers.
  • ...and 3 more figures