Table of Contents
Fetching ...

MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator

Xuehai He, Shijie Zhou, Thivyanth Venkateswaran, Kaizhi Zheng, Ziyu Wan, Achuta Kadambi, Xin Eric Wang

TL;DR

MorphoSim addresses the need for programmable, multi-view 4D world models in robotics by integrating a language-driven interface with trajectory-guided 4D generation and editable 4D representations. The approach couples an LLM-based command parameterizer, a Scene Generator that uses trajectory-conditioned diffusion and dynamic 3D Gaussians, and a Scene Editor for object-level edits such as color changes, extraction, and removal, all while preserving temporal and multi-view coherence. Key contributions include the three-module architecture, trajectory-guided cross-attention mechanisms, a dynamic control submodule, and a static edit pathway with feature-field distillation, enabling both data generation and robust evaluation of visuomotor policies. The framework demonstrates high-fidelity 4D scenes and flexible edits on robotics-relevant scenarios, facilitating synthetic data creation, controlled perturbations for evaluation, and rapid task-variant construction.

Abstract

World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.

MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator

TL;DR

MorphoSim addresses the need for programmable, multi-view 4D world models in robotics by integrating a language-driven interface with trajectory-guided 4D generation and editable 4D representations. The approach couples an LLM-based command parameterizer, a Scene Generator that uses trajectory-conditioned diffusion and dynamic 3D Gaussians, and a Scene Editor for object-level edits such as color changes, extraction, and removal, all while preserving temporal and multi-view coherence. Key contributions include the three-module architecture, trajectory-guided cross-attention mechanisms, a dynamic control submodule, and a static edit pathway with feature-field distillation, enabling both data generation and robust evaluation of visuomotor policies. The framework demonstrates high-fidelity 4D scenes and flexible edits on robotics-relevant scenarios, facilitating synthetic data creation, controlled perturbations for evaluation, and rapid task-variant construction.

Abstract

World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.

Paper Structure

This paper contains 13 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: MorphoSim is a fully natural language-guided 4D scene generation engine that enables generation and editing of 4D scenes based on language commands. Given a natural language input, MorphoSim constructs a 4D scene and provides a unified framework for multiple tasks, including high-quality scene generation, interactive modification of object motion and appearance, and object extraction or removal.
  • Figure 2: Overview of the MorphoSim pipeline. It consists of a command parameterizer for natural language comprehension, a controllable scene generating module which supports generation of 4D scenes following dynamic objects motion guidance, and an interactive scene editing module for executing edits.
  • Figure 3: Qualitative examples of 4D scene editing in MorphoSim for object motion control during the generation stage. MorphoSim allows specifying different object motion directions in natural language forms and subsequently changes the scene to ensure objects move according to the given instructions.
  • Figure 4: Qualitative examples of 4D scene editing in MorphoSim during the reconstruction stage. (a) and (b) demonstrate color editing, (c) and (d) show object extraction, while (e) and (f) illustrate object removal. In each subfigure: The first row shows the generated global view from the text prompt; The second row presents the global view after scene editing; The third row displays the novel view after editing. The language commands for each example are as follows: (a) "The fish swims through the crystal-clear waters from right to left" to generate the scene, followed by "Make the color of the fish and seaweed black." (b) "The bus is moving from right to left" to generate the scene, followed by "Make the bus yellow." (c) "A serene boat glides gracefully through tranquil waters from left to right" to generate the scene, followed by "Extract the boat." (d) "A car is moving from right to left through a serene sunlit landscape" to generate the scene, followed by "Extract the car." (e) "A small, vibrant red rubber ball is bouncing from right to left" to generate the scene, followed by "Delete the ball." (f) "A sleek black motorcycle is gliding effortlessly from right to left" to generate the scene, followed by "Delete the motorcycle."