LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations
Tung Do, Thuan Hoang Nguyen, Anh Tuan Tran, Rang Nguyen, Binh-Son Hua
TL;DR
LiftRefine tackles single- and few-view novel view synthesis by first lifting inputs into a coarse volumetric radiance field and a high-resolution tri-plane, then refining renderings with a latent diffusion model conditioned on Stage-1 features. A progressive inference scheme alternates reconstruction and diffusion to progressively fill in unseen regions while maintaining view consistency, achieving state-of-the-art results on CO3D, Google Scanned Object, and Objaverse. The approach balances memory efficiency and detail by fusing volume and tri-plane representations with diffusion-based hallucination, delivering high-quality, multi-view-consistent novel views. This framework enables efficient, scalable 3D-aware view synthesis with plausible occluded-region details and improved perceptual fidelity, advancing practical 3D content creation from limited input views.
Abstract
We propose a new view synthesis method via synthesizing a 3D neural field from both single or few-view input images. To address the ill-posed nature of the image-to-3D generation problem, we devise a two-stage method that involves a reconstruction model and a diffusion model for view synthesis. Our reconstruction model first lifts one or more input images to the 3D space from a volume as the coarse-scale 3D representation followed by a tri-plane as the fine-scale 3D representation. To mitigate the ambiguity in occluded regions, our diffusion model then hallucinates missing details in the rendered images from tri-planes. We then introduce a new progressive refinement technique that iteratively applies the reconstruction and diffusion model to gradually synthesize novel views, boosting the overall quality of the 3D representations and their rendering. Empirical evaluation demonstrates the superiority of our method over state-of-the-art methods on the synthetic SRN-Car dataset, the in-the-wild CO3D dataset, and large-scale Objaverse dataset while achieving both sampling efficacy and multi-view consistency.
