Table of Contents
Fetching ...

SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting

Sung-Yeon Park, Adam Lee, Juanwu Lu, Can Cui, Luyang Jiang, Rohit Gupta, Kyungtae Han, Ahmadreza Moradipari, Ziran Wang

TL;DR

SIMSplat tackles the challenge of editable, realistic driving-scene generation by unifying language-grounded querying with a scene-graph based 4D Gaussian Splat representation. It introduces a four-stage pipeline: scene reconstruction with 4D Gaussians, language-grounding of appearance and motion, an LLM-driven editor, and a multi-agent path refinement to ensure globally coherent interactions. On the Waymo Open Dataset, it achieves state-of-the-art performance in road-object querying, the highest task-completion rate for editing prompts, and the lowest collision/failure rates due to predictive refinement. This approach enables intuitive natural-language editing of dynamic traffic scenes, including pedestrians, and offers a scalable foundation for realistic scenario generation in autonomous driving research.

Abstract

Driving scene manipulation with sensor data is emerging as a promising alternative to traditional virtual driving simulators. However, existing frameworks struggle to generate realistic scenarios efficiently due to limited editing capabilities. To address these challenges, we present SIMSplat, a predictive driving scene editor with language-aligned Gaussian splatting. As a language-controlled editor, SIMSplat enables intuitive manipulation using natural language prompts. By aligning language with Gaussian-reconstructed scenes, it further supports direct querying of road objects, allowing precise and flexible editing. Our method provides detailed object-level editing, including adding new objects and modifying the trajectories of both vehicles and pedestrians, while also incorporating predictive path refinement through multi-agent motion prediction to generate realistic interactions among all agents in the scene. Experiments on the Waymo dataset demonstrate SIMSplat's extensive editing capabilities and adaptability across a wide range of scenarios. Project page: https://sungyeonparkk.github.io/simsplat/

SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting

TL;DR

SIMSplat tackles the challenge of editable, realistic driving-scene generation by unifying language-grounded querying with a scene-graph based 4D Gaussian Splat representation. It introduces a four-stage pipeline: scene reconstruction with 4D Gaussians, language-grounding of appearance and motion, an LLM-driven editor, and a multi-agent path refinement to ensure globally coherent interactions. On the Waymo Open Dataset, it achieves state-of-the-art performance in road-object querying, the highest task-completion rate for editing prompts, and the lowest collision/failure rates due to predictive refinement. This approach enables intuitive natural-language editing of dynamic traffic scenes, including pedestrians, and offers a scalable foundation for realistic scenario generation in autonomous driving research.

Abstract

Driving scene manipulation with sensor data is emerging as a promising alternative to traditional virtual driving simulators. However, existing frameworks struggle to generate realistic scenarios efficiently due to limited editing capabilities. To address these challenges, we present SIMSplat, a predictive driving scene editor with language-aligned Gaussian splatting. As a language-controlled editor, SIMSplat enables intuitive manipulation using natural language prompts. By aligning language with Gaussian-reconstructed scenes, it further supports direct querying of road objects, allowing precise and flexible editing. Our method provides detailed object-level editing, including adding new objects and modifying the trajectories of both vehicles and pedestrians, while also incorporating predictive path refinement through multi-agent motion prediction to generate realistic interactions among all agents in the scene. Experiments on the Waymo dataset demonstrate SIMSplat's extensive editing capabilities and adaptability across a wide range of scenarios. Project page: https://sungyeonparkk.github.io/simsplat/

Paper Structure

This paper contains 12 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our framework enables language-guided editing in driving scenarios. It begins by directly querying the target object from the Gaussian scene, followed by editing through the LLM agent, and verification of all agents' motions via multi-agent path refinement. This process ensures that the final edited scene remains realistic, even including pedestrians. The red/green areas indicate possible collision and refined region, respectively.
  • Figure 2: Pipeline of language alignment. The appearance and temporal alignment modules extract appearance, motion, and location features, which are then embedded into scene-graph Gaussians. Given a natural language prompt, these features enable grounding of the corresponding objects in the road scene.
  • Figure 3: Editing process. Given a user prompt, the LLM agent coordinates multiple modules. After identifying the target object, retrieving assets, and planning an initial trajectory, the results are refined by the multi-agent path refinement module. Finally, diffusion-based inpainting is applied and the edited scene is rendered.
  • Figure 4: Qualitative comparison of object querying. Our method effectively captures the behaviors of road agents within the scene.
  • Figure 5: Qualitative editing results. SIMSplat supports various types of editing, including adding new objects, removal or replacement, and modification of pedestrians or vehicles. Gray areas indicate the edited regions, and images are zoomed in for clearer visualization.
  • ...and 1 more figures