Table of Contents
Fetching ...

DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes

Yiyuan Liang, Zhiying Yan, Liqun Chen, Jiahuan Zhou, Luxin Yan, Sheng Zhong, Xu Zou

TL;DR

This work tackles object editing in driving scenes for data augmentation and model evaluation in autonomous driving. It introduces DriveEditor, a diffusion-based framework that unifies repositioning, insertion, replacement, and deletion through a shared input set, leveraging a Depth-aware Position Controller and a multi-level Appearance Maintenance module. Appearance is preserved via low-level cut-and-paste, high-level CLIP-guided semantics, and 3D priors from SV3D integrated through a 3D Information Fusion Module, while 3D position is controlled by depth-aware projections of 3D bounding boxes into the image plane. Evaluations on nuScenes show high fidelity and temporal coherence, enabling long-video iterative editing and providing effective data augmentation for downstream 3D object detection, with some generalization to unseen datasets like Waymo.

Abstract

Vision-centric autonomous driving systems require diverse data for robust training and evaluation, which can be augmented by manipulating object positions and appearances within existing scene captures. While recent advancements in diffusion models have shown promise in video editing, their application to object manipulation in driving scenarios remains challenging due to imprecise positional control and difficulties in preserving high-fidelity object appearances. To address these challenges in position and appearance control, we introduce DriveEditor, a diffusion-based framework for object editing in driving videos. DriveEditor offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion. These diverse manipulations are all achieved through a shared set of varying inputs, processed by identical position control and appearance maintenance modules. The position control module projects the given 3D bounding box while preserving depth information and hierarchically injects it into the diffusion process, enabling precise control over object position and orientation. The appearance maintenance module preserves consistent attributes with a single reference image by employing a three-tiered approach: low-level detail preservation, high-level semantic maintenance, and the integration of 3D priors from a novel view synthesis model. Extensive qualitative and quantitative evaluations on the nuScenes dataset demonstrate DriveEditor's exceptional fidelity and controllability in generating diverse driving scene edits, as well as its remarkable ability to facilitate downstream tasks. Project page: https://yvanliang.github.io/DriveEditor.

DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes

TL;DR

This work tackles object editing in driving scenes for data augmentation and model evaluation in autonomous driving. It introduces DriveEditor, a diffusion-based framework that unifies repositioning, insertion, replacement, and deletion through a shared input set, leveraging a Depth-aware Position Controller and a multi-level Appearance Maintenance module. Appearance is preserved via low-level cut-and-paste, high-level CLIP-guided semantics, and 3D priors from SV3D integrated through a 3D Information Fusion Module, while 3D position is controlled by depth-aware projections of 3D bounding boxes into the image plane. Evaluations on nuScenes show high fidelity and temporal coherence, enabling long-video iterative editing and providing effective data augmentation for downstream 3D object detection, with some generalization to unseen datasets like Waymo.

Abstract

Vision-centric autonomous driving systems require diverse data for robust training and evaluation, which can be augmented by manipulating object positions and appearances within existing scene captures. While recent advancements in diffusion models have shown promise in video editing, their application to object manipulation in driving scenarios remains challenging due to imprecise positional control and difficulties in preserving high-fidelity object appearances. To address these challenges in position and appearance control, we introduce DriveEditor, a diffusion-based framework for object editing in driving videos. DriveEditor offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion. These diverse manipulations are all achieved through a shared set of varying inputs, processed by identical position control and appearance maintenance modules. The position control module projects the given 3D bounding box while preserving depth information and hierarchically injects it into the diffusion process, enabling precise control over object position and orientation. The appearance maintenance module preserves consistent attributes with a single reference image by employing a three-tiered approach: low-level detail preservation, high-level semantic maintenance, and the integration of 3D priors from a novel view synthesis model. Extensive qualitative and quantitative evaluations on the nuScenes dataset demonstrate DriveEditor's exceptional fidelity and controllability in generating diverse driving scene edits, as well as its remarkable ability to facilitate downstream tasks. Project page: https://yvanliang.github.io/DriveEditor.
Paper Structure (26 sections, 6 equations, 17 figures, 4 tables)

This paper contains 26 sections, 6 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Visualizations of the editing capability of DriveEditor and baselines. (a) DriveEditor enables user-friendly repositioning, insertion, replacement, and deletion within a unified framework. It precisely controls an object's position and orientation based on the 3D bounding box (top left; required for repositioning and insertion tasks that alter object position), and maintains high-fidelity appearance attributes of the object from a single reference image (bottom left, required for insertion and replacement tasks that alter object appearance). (b) The deletion and replacement results compared with baselines. ProPainter's deletion results suffer from artifacts. Text2Video-Zero employs the text prompt "replace the champagne-colored car with dark gray van" to guide the replacement process. Yet, it produces unrealistic visual results and alters the appearance of other vehicles.
  • Figure 2: (a) High-level overview of DriveEditor. (b) Diagram of the training pipeline of DriveEditor. Three levels of appearance control are established based on the single reference image $\textbf{I}^r$: low-level details preservation through a cut-and-paste approach, high-level semantics maintenance through cross-attention (omitted in the pipeline for brevity), and incorporation of 3D priors derived from the frozen SV3D U-Net. For position control, we perform a projection that preserves depth information, followed by the Pose Controller to extract multi-scale features. Control signals are injected through three distinct paths in block of the video model: position features into ResBlocks, semantic features via cross-attention, and 3D features added to block outputs.
  • Figure 3: DriveEditor is trained to reconstruct occluded objects using inputs from our dataset. At inference time, it performs various editing tasks based on specific input prompts.
  • Figure 4: Top row: Original videos. Middle left: Qualitative comparison on the deletion task. ProPainter suffers from artifacts, while SD lacks temporal consistency. DriveEditor effectively generates plausible occluded regions. Middle Right: Qualitative comparison on the replacement task. T2V loses realism, for instance, the roof color remains unchanged. TAV alters the overall style of the video and leads to object deformations. In contrast, DriveEditor maintains high-fidelity object details from the reference image. Bottom Left: Visualization of object insertion using DriveEditor. It enables precise control over object insertion position while maintaining appearance from the reference image. Bottom Right: Visualization of object repositioning using DriveEditor. The object is accurately repositioned to align with the GT bounding box while preserving its original appearance.
  • Figure 5: Effectiveness of our proposed modules in controlling the position and orientation of objects. GT bounding boxes are outlined in black within the images.
  • ...and 12 more figures