Table of Contents
Fetching ...

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu

TL;DR

4DNeX tackles the problem of generating dynamic 4D scenes from a single image by fine-tuning a pretrained video diffusion model in a feed-forward manner. It introduces a unified 6D RGB+XYZ video representation and simple adaptation strategies, supported by the 4DNeX-10M dataset with high-quality pseudo-annotations, to enable efficient image-to-4D generation and novel-view video synthesis. The approach achieves high-quality dynamic point clouds and competitive novel-view results with substantially improved efficiency, demonstrating strong generalization to in-the-wild scenes. This work lays the groundwork for scalable, data-efficient 4D world models capable of simulating dynamic scene evolution from a single image.

Abstract

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

TL;DR

4DNeX tackles the problem of generating dynamic 4D scenes from a single image by fine-tuning a pretrained video diffusion model in a feed-forward manner. It introduces a unified 6D RGB+XYZ video representation and simple adaptation strategies, supported by the 4DNeX-10M dataset with high-quality pseudo-annotations, to enable efficient image-to-4D generation and novel-view video synthesis. The approach achieves high-quality dynamic point clouds and competitive novel-view results with substantially improved efficiency, demonstrating strong generalization to in-the-wild scenes. This work lays the groundwork for scalable, data-efficient 4D world models capable of simulating dynamic scene evolution from a single image.

Abstract

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

Paper Structure

This paper contains 22 sections, 13 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Visualization of 4DNeX-10M Dataset. Our dataset spans a wide range of dynamic scenarios, including indoor, outdoor, close-range, far-range, static, high-speed, and human-centric scenes. The word cloud summarizes common visual concepts captured in the dataset, while the 4D point clouds and camera trajectories demonstrate the spatial precision of our pseudo-annotations.
  • Figure 2: Data Curation Pipeline. The video data is collected from various sources and then selected by video filtering during Data Cleaning. The selected data is captioned via LLaVA-Next-Video model in Video Captioning. The selected data is processed and finally filtered out the video with high-quality annotation during 3D/4D Annotation. Data statistics is also provided in bottom right.
  • Figure 3: Comparison of fusion strategies for joint RGB and XYZ modeling. We explore five fusion strategies and analyze their impact on model compatibility and cross-modal alignment.
  • Figure 4: Comparison of spatial fusion strategies. We compare frame-, height-, and width-wise fusion in terms of the interaction distance between RGB and XYZ tokens.
  • Figure 5: Overview of 4DNeX. Given a single RGB image and an initialized XYZ map, 4DNeX encodes both inputs with a VAE encoder and fuses them via width-wise concatenation. The fused latent, combined with a noise latent and a guided mask, is processed by a LoRA-tuned Wan-DiT model to jointly generate RGB and XYZ videos. A lightweight post-optimization step recovers camera parameters and depth maps from the predicted outputs.
  • ...and 6 more figures