Table of Contents
Fetching ...

LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

Yabo Chen, Chen Yang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, Wei Shen, Wenrui Dai, Hongkai Xiong, Qi Tian

TL;DR

LiftImage3D tackles single-image 3D reconstruction by leveraging latent video diffusion model priors while ensuring 3D-consistent outputs. It introduces an articulated trajectory generation strategy, a robust neural matching module for pose estimation, and a distortion-aware 3D Gaussian Splatting representation that decouples canonical geometry from frame distortions, augmented by depth-prior injection. The framework achieves state-of-the-art results on LLFF, DL3DV, and Tanks and Temples and generalizes to diverse in-the-wild inputs, including cartoons and complex real scenes. This work advances practical single-image to 3D synthesis by combining diffusion-based priors with explicit geometric calibration and distortion modeling. Its components offer a scalable, controllable pathway to convert a single image into a coherent 3D Gaussian-based scene for rendering and analysis.

Abstract

Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.

LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

TL;DR

LiftImage3D tackles single-image 3D reconstruction by leveraging latent video diffusion model priors while ensuring 3D-consistent outputs. It introduces an articulated trajectory generation strategy, a robust neural matching module for pose estimation, and a distortion-aware 3D Gaussian Splatting representation that decouples canonical geometry from frame distortions, augmented by depth-prior injection. The framework achieves state-of-the-art results on LLFF, DL3DV, and Tanks and Temples and generalizes to diverse in-the-wild inputs, including cartoons and complex real scenes. This work advances practical single-image to 3D synthesis by combining diffusion-based priors with explicit geometric calibration and distortion modeling. Its components offer a scalable, controllable pathway to convert a single image into a coherent 3D Gaussian-based scene for rendering and analysis.

Abstract

Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.

Paper Structure

This paper contains 23 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: LiftImage3D is a universal framework that utilizes video generation priors to lift any single 2D image into 3D Gaussians, capable of handling in-the-wild 3D objects/scenes.
  • Figure 2: The overall pipeline of LiftImage3D. We first extend LVDM to generate multiple video clips from a single image using an articulated camera trajectory strategy. Then all generated frames are matched using the robust neural matching module and registered into a point cloud. After that, we initialize Gaussians from registered point clouds and construct a distortion field to model the independent distortion of each video frame upon canonical 3DGS.
  • Figure 3: Articulated trajectory generation pipeline. The gray cameras indicate previously generated frames, and the orange cameras show the current generation sequence. The process iteratively generates frames following a predefined trajectory (orange gradient), where each subsequent generation uses the terminal frame from the previous sequence as its input. This cascading approach enables comprehensive object coverage through controlled camera trajectories.
  • Figure 4: The overall qualitative results of our methods compared with AdaMPI Adampi, SinMPI SinMPI, LucidDreamer luciddreamer and ViewCrafter viewcrafter.
  • Figure 5: Visualization of proposed depth prior injection. The first column lays the video frames generated by LVDM or input images. The second column shows the monocular depth derived from Depth Anything v2 depth_anything_v2. The third column shows the coarse depth maps with scale predicted by MASt3R. The fourth column is the calibrated result providing the fine depth estimates with scales, showing the effectiveness of our depth prior injection module in providing accurate and fine-detailed depth priors.