Table of Contents
Fetching ...

HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

Mauricio Soroco, Francesco Pittaluga, Zaid Tasneem, Abhishek Aich, Bingbing Zhuang, Wuyang Chen, Manmohan Chandraker, Ziyu Jiang

Abstract

Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/

HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

Abstract

Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/

Paper Structure

This paper contains 46 sections, 6 equations, 16 figures, 17 tables.

Figures (16)

  • Figure 1: In each example, from left to right are the input image, LangMasks, and output image. Masks for global edits are blank. The careful design of HorizonWeaver across data (Section \ref{['sec:automatic-paired-dataset-generation']}), model (Section \ref{['sec:LangMasks for fine-grained editing']}), and training (Section \ref{['sec:methods-training-objectives']}) addresses three critical challenges in driving scene editing: 1) mult-level granularity (rows 2, 3); 2) rich high-level semantics (rows 1, 2); 3) ubiquitous domains shifts (Tab. \ref{['tab:combined_OOD_editing_performance']}).
  • Figure 2: Dataset Construction. Real-world data (Sec. \ref{['subsec:real-world-paird-data']}) are paired by camera pose and annotated using an image descriptor pipeline and passed to an LLM to produce instructions. Pseudo-data (Sec \ref{['sec:methods-pseudo-dataset-development']}) for Local edits crop an annotated object before VLM filtering; global edits apply VLM filtering to full images. Our dataset is composed of image pairs, global editing instructions, and masks indicating fine-grained edits to perform.
  • Figure 3: LangMask Generation and Training. Left: To provide fine-grained instructions with rich semantics, we insert CLIP text features into binary masks (Section \ref{['sec:LangMasks for fine-grained editing']}). Right: To support LangMasks, we copy and expand the VAE. It is trained end to end with the editing model.
  • Figure 4: Training language-guided driving scene image editing. Our training pipeline supports both supervised training for paired images and unsupervised training for unpaired ones (e.g. downstream unseen real scenarios). We include three training objectives: supervised fine-tuning $\mathcal{L}_\text{sft}$ (Section \ref{['sec:sft']}), cycle consistency $\mathcal{L}_\text{cycle}$ (Section \ref{['sec:cycle_consistency']}), and $\mathcal{L}_\text{clip}$ (Section \ref{['sec:clip_loss']}).
  • Figure 5: HorizonWeaver Editing. Rows 1,2 Local edits: the masks (projected as binary images and stated in text for reference) enable modifications to traffic. Rows 3, 4 Global edits: the text prompt informs the appearance of the scene. For brevity, only the portions relevant to the shown edits are displayed. Rows 5, 6 Compound edits: The masks (projected as binary images) enable modifications to traffic while the text prompt informs the desired global appearance. We compare to Qwen wu2025qwenimagetechnicalreport, OmniGen2 wu2025omnigen2, UltraEdit zhao2024ultraeditinstructionbasedfinegrainedimage, and BAGEL deng2025emergingpropertiesunifiedmultimodal.
  • ...and 11 more figures