Any4D: Open-Prompt 4D Generation from Natural Language and Images
Hao Li, Qiao Sun
TL;DR
Any4D tackles open-prompt 4D scene generation from a single image or natural language by tying generation and reconstruction through a shared camera trajectory. It introduces a two-stage pipeline: (1) camera-controlled video generation via Plücker-embedded trajectories conditioned diffusion, and (2) a persistent 3D Gaussian representation with hybrid SE(3) motion bases to reconstruct temporally coherent 4D geometry, with a loss that includes a motion-coefficient term. The approach achieves state-of-the-art-like reconstruction quality on monocular datasets, outperforming baselines in PSNR, SSIM, and LPIPS while running efficiently on a single RTX 3090, and it enables practical applications such as 3D tracking and multi-view content synthesis. Limitations include difficulty with large viewpoint changes, long-range motions, and deformations, as well as reliance on pretrained video diffusion models which may carry upstream biases.
Abstract
While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
