Table of Contents
Fetching ...

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Ziyao Lin, Cheng Lu

TL;DR

The proposed DrivePTS incorporates a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint, and introduces a frequency-guided structure loss to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity.

Abstract

Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

TL;DR

The proposed DrivePTS incorporates a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint, and introduces a frequency-guided structure loss to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity.

Abstract

Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.
Paper Structure (35 sections, 8 equations, 12 figures, 5 tables)

This paper contains 35 sections, 8 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Comparison of various scene generation methods on the modified map layouts. The first column highlights the modified regions of each map. While MagicDrive fails to adapt to map modifications, our DrivePTS successfully generates scenes aligning with the updated map configurations.
  • Figure 2: The overall architecture of our proposed DrivePTS. The left part illustrates our adopted generative network structure, while the center depicts the training process corresponding to the proposed progressive learning strategy. The right part highlights the implementation of frequency-guided structure loss.
  • Figure 3: Impact of different iterative steps on the FID and geometry controllability of generated images.
  • Figure 4: Examples of controllable road layout generation via HD map editing. Each row shows a case where the map is modified to introduce new road structures, with the generated image accurately reflecting these changes.
  • Figure 5: Visualization of the impact of textual descriptions on scene reconstruction. Rows with the same index denote the same scene. Improvements brought by multi-view hierarchical descriptions are highlighted with various colored circles or rectangles.
  • ...and 7 more figures