LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

Yabo Chen; Chen Yang; Jiemin Fang; Xiaopeng Zhang; Lingxi Xie; Wei Shen; Wenrui Dai; Hongkai Xiong; Qi Tian

LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

Yabo Chen, Chen Yang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, Wei Shen, Wenrui Dai, Hongkai Xiong, Qi Tian

TL;DR

LiftImage3D tackles single-image 3D reconstruction by leveraging latent video diffusion model priors while ensuring 3D-consistent outputs. It introduces an articulated trajectory generation strategy, a robust neural matching module for pose estimation, and a distortion-aware 3D Gaussian Splatting representation that decouples canonical geometry from frame distortions, augmented by depth-prior injection. The framework achieves state-of-the-art results on LLFF, DL3DV, and Tanks and Temples and generalizes to diverse in-the-wild inputs, including cartoons and complex real scenes. This work advances practical single-image to 3D synthesis by combining diffusion-based priors with explicit geometric calibration and distortion modeling. Its components offer a scalable, controllable pathway to convert a single image into a coherent 3D Gaussian-based scene for rendering and analysis.

Abstract

Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.

LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

TL;DR

Abstract

LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)