Table of Contents
Fetching ...

One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control

Zhenxing Mi, Yuxin Wang, Dan Xu

TL;DR

One4D tackles the challenge of joint RGB and geometry modeling for dynamic 4D content by unifying generation and reconstruction within a single video diffusion model. It introduces Decoupled LoRA Control (DLC) to maintain RGB priors while learning geometry via modality-specific adapters, and Unified Masked Conditioning (UMC) to support single-image, sparse-frame, and full-video conditioning without architecture changes. The system is trained on a mixture of synthetic and real 4D data and demonstrates strong performance in both 4D generation and reconstruction, outperforming prior generation baselines and rivaling reconstruction-focused methods while preserving geometry accuracy. This work advances geometry-aware world modeling with diffusion models and offers a practical, data-efficient pathway to high-quality dynamic 4D scenes.

Abstract

We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D

One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control

TL;DR

One4D tackles the challenge of joint RGB and geometry modeling for dynamic 4D content by unifying generation and reconstruction within a single video diffusion model. It introduces Decoupled LoRA Control (DLC) to maintain RGB priors while learning geometry via modality-specific adapters, and Unified Masked Conditioning (UMC) to support single-image, sparse-frame, and full-video conditioning without architecture changes. The system is trained on a mixture of synthetic and real 4D data and demonstrates strong performance in both 4D generation and reconstruction, outperforming prior generation baselines and rivaling reconstruction-focused methods while preserving geometry accuracy. This work advances geometry-aware world modeling with diffusion models and offers a practical, data-efficient pathway to high-quality dynamic 4D scenes.

Abstract

We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D

Paper Structure

This paper contains 13 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: One4D supports single-image-to-4D generation, sparse-frame-to-4D generation, and full-video reconstruction in a single model. It outputs synchronized RGB frames and pointmaps, visualized as 4D point clouds with cameras, and RGB-depth sequences.
  • Figure 2: Architecture comparison for joint RGB and geometry modeling. (a) Channel-wise and (b) spatial-wise concatenation feed RGB and XYZ into a single diffusion model with a shared LoRA branch. (c) Our Decoupled LoRA Control (DLC) employs two modality-specific LoRA branches with zero-initialized control links, achieving decoupled yet controlled RGB–XYZ joint generation. $\copyright$ denotes concatenation and $\oplus$ denotes pixel-wise addition.
  • Figure 3: Comparison of architectures for joint RGB–geometry generation. Our Decoupled LoRA Control produces cleaner RGB and sharper, more consistent XYZ and depth than channel-wise and spatial-wise concatenation, while channel-wise concatenation severely degrades both appearance and geometry.
  • Figure 4: Overview of the One4D framework. Unified Masked Conditioning (UMC) packs single-image, sparse-frame, and full-video inputs into a masked conditioning video. RGB and XYZ videos are encoded into latent spaces via video VAEs, and the conditioning latents are concatenated only with noisy RGB latents. These RGB and XYZ latents are then processed by a DiT backbone with Decoupled LoRA Control (DLC). DLC employs modality-specific LoRA branches to decouple computation, and zero-initialized cross-modal control links to learn pixel-wise consistency. The denoised RGB and XYZ latents are finally decoded into RGB frames and pointmaps.
  • Figure 5: Single-image-to-4D generation comparison between 4DNeX chen20254dnex and our One4D. Compared to 4DNeX, One4D produces more dynamic and realistic videos, sharper and cleaner depth, and more complete, coherent 4D point clouds with cameras.
  • ...and 2 more figures