DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes
Yiyuan Liang, Zhiying Yan, Liqun Chen, Jiahuan Zhou, Luxin Yan, Sheng Zhong, Xu Zou
TL;DR
This work tackles object editing in driving scenes for data augmentation and model evaluation in autonomous driving. It introduces DriveEditor, a diffusion-based framework that unifies repositioning, insertion, replacement, and deletion through a shared input set, leveraging a Depth-aware Position Controller and a multi-level Appearance Maintenance module. Appearance is preserved via low-level cut-and-paste, high-level CLIP-guided semantics, and 3D priors from SV3D integrated through a 3D Information Fusion Module, while 3D position is controlled by depth-aware projections of 3D bounding boxes into the image plane. Evaluations on nuScenes show high fidelity and temporal coherence, enabling long-video iterative editing and providing effective data augmentation for downstream 3D object detection, with some generalization to unseen datasets like Waymo.
Abstract
Vision-centric autonomous driving systems require diverse data for robust training and evaluation, which can be augmented by manipulating object positions and appearances within existing scene captures. While recent advancements in diffusion models have shown promise in video editing, their application to object manipulation in driving scenarios remains challenging due to imprecise positional control and difficulties in preserving high-fidelity object appearances. To address these challenges in position and appearance control, we introduce DriveEditor, a diffusion-based framework for object editing in driving videos. DriveEditor offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion. These diverse manipulations are all achieved through a shared set of varying inputs, processed by identical position control and appearance maintenance modules. The position control module projects the given 3D bounding box while preserving depth information and hierarchically injects it into the diffusion process, enabling precise control over object position and orientation. The appearance maintenance module preserves consistent attributes with a single reference image by employing a three-tiered approach: low-level detail preservation, high-level semantic maintenance, and the integration of 3D priors from a novel view synthesis model. Extensive qualitative and quantitative evaluations on the nuScenes dataset demonstrate DriveEditor's exceptional fidelity and controllability in generating diverse driving scene edits, as well as its remarkable ability to facilitate downstream tasks. Project page: https://yvanliang.github.io/DriveEditor.
