Table of Contents
Fetching ...

Any4D: Open-Prompt 4D Generation from Natural Language and Images

Hao Li, Qiao Sun

TL;DR

Any4D tackles open-prompt 4D scene generation from a single image or natural language by tying generation and reconstruction through a shared camera trajectory. It introduces a two-stage pipeline: (1) camera-controlled video generation via Plücker-embedded trajectories conditioned diffusion, and (2) a persistent 3D Gaussian representation with hybrid SE(3) motion bases to reconstruct temporally coherent 4D geometry, with a loss that includes a motion-coefficient term. The approach achieves state-of-the-art-like reconstruction quality on monocular datasets, outperforming baselines in PSNR, SSIM, and LPIPS while running efficiently on a single RTX 3090, and it enables practical applications such as 3D tracking and multi-view content synthesis. Limitations include difficulty with large viewpoint changes, long-range motions, and deformations, as well as reliance on pretrained video diffusion models which may carry upstream biases.

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Any4D: Open-Prompt 4D Generation from Natural Language and Images

TL;DR

Any4D tackles open-prompt 4D scene generation from a single image or natural language by tying generation and reconstruction through a shared camera trajectory. It introduces a two-stage pipeline: (1) camera-controlled video generation via Plücker-embedded trajectories conditioned diffusion, and (2) a persistent 3D Gaussian representation with hybrid SE(3) motion bases to reconstruct temporally coherent 4D geometry, with a loss that includes a motion-coefficient term. The approach achieves state-of-the-art-like reconstruction quality on monocular datasets, outperforming baselines in PSNR, SSIM, and LPIPS while running efficiently on a single RTX 3090, and it enables practical applications such as 3D tracking and multi-view content synthesis. Limitations include difficulty with large viewpoint changes, long-range motions, and deformations, as well as reliance on pretrained video diffusion models which may carry upstream biases.

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Paper Structure

This paper contains 21 sections, 19 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: System Overview. Given a text prompt or a single image, our framework first encodes it into a latent spatial representation using a VAE encoder. Concurrently, based on camera intrinsic $\mathbf{K}$ and extrinsic $\mathbf{E}$ parameters, the specified trajectory is encoded using Plücker coordinates. A video sequence with the desired camera motion is then generated by CogVideoX yang2025cogvideoxtexttovideodiffusionmodels (\ref{['subsec:camera-control']}). Using off-the-shelf models doersch2023tapirtrackingpointperframeyang2024depthanythingunleashingpower, we extract depth maps and 2D point trajectories from the generated video. These, along with the RGB frames, serve as input for the reconstruction stage. We design a persistent 3D Gaussian representation for dynamic scenes, where motion is modeled via a set of globally shared and compact hybrid motion bases $\mathbb{SE}(3)$. The motion of each Gaussian is expressed as a linear combination of these bases, enabling efficient modeling of complex dynamics (\ref{['subsec:4D-Reconstruction']}).
  • Figure 2: 4D scene generation. Our present a novel 4D dynamic scene generation framework that synthesizes high-quality, semantically rich, and spatiotemporally consistent dynamic scenes from a single image or natural language instruction, conditioned on target camera trajectories.
  • Figure 3: Modeling results from novel viewpoints not directly observed by the camera trajectory. Views include left, top, right, and bottom perspectives relative to the reference image in the 4D scene (arrows indicate viewing directions).
  • Figure 4: Visual comparison of reconstruction quality on iPhone dataset.
  • Figure 5: 3D Tracking visualization on three Datasets. Trajectories are visualized in the 3D world coordinate system, reflecting object motion within the scene.
  • ...and 7 more figures