Table of Contents
Fetching ...

LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations

Tung Do, Thuan Hoang Nguyen, Anh Tuan Tran, Rang Nguyen, Binh-Son Hua

TL;DR

LiftRefine tackles single- and few-view novel view synthesis by first lifting inputs into a coarse volumetric radiance field and a high-resolution tri-plane, then refining renderings with a latent diffusion model conditioned on Stage-1 features. A progressive inference scheme alternates reconstruction and diffusion to progressively fill in unseen regions while maintaining view consistency, achieving state-of-the-art results on CO3D, Google Scanned Object, and Objaverse. The approach balances memory efficiency and detail by fusing volume and tri-plane representations with diffusion-based hallucination, delivering high-quality, multi-view-consistent novel views. This framework enables efficient, scalable 3D-aware view synthesis with plausible occluded-region details and improved perceptual fidelity, advancing practical 3D content creation from limited input views.

Abstract

We propose a new view synthesis method via synthesizing a 3D neural field from both single or few-view input images. To address the ill-posed nature of the image-to-3D generation problem, we devise a two-stage method that involves a reconstruction model and a diffusion model for view synthesis. Our reconstruction model first lifts one or more input images to the 3D space from a volume as the coarse-scale 3D representation followed by a tri-plane as the fine-scale 3D representation. To mitigate the ambiguity in occluded regions, our diffusion model then hallucinates missing details in the rendered images from tri-planes. We then introduce a new progressive refinement technique that iteratively applies the reconstruction and diffusion model to gradually synthesize novel views, boosting the overall quality of the 3D representations and their rendering. Empirical evaluation demonstrates the superiority of our method over state-of-the-art methods on the synthetic SRN-Car dataset, the in-the-wild CO3D dataset, and large-scale Objaverse dataset while achieving both sampling efficacy and multi-view consistency.

LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations

TL;DR

LiftRefine tackles single- and few-view novel view synthesis by first lifting inputs into a coarse volumetric radiance field and a high-resolution tri-plane, then refining renderings with a latent diffusion model conditioned on Stage-1 features. A progressive inference scheme alternates reconstruction and diffusion to progressively fill in unseen regions while maintaining view consistency, achieving state-of-the-art results on CO3D, Google Scanned Object, and Objaverse. The approach balances memory efficiency and detail by fusing volume and tri-plane representations with diffusion-based hallucination, delivering high-quality, multi-view-consistent novel views. This framework enables efficient, scalable 3D-aware view synthesis with plausible occluded-region details and improved perceptual fidelity, advancing practical 3D content creation from limited input views.

Abstract

We propose a new view synthesis method via synthesizing a 3D neural field from both single or few-view input images. To address the ill-posed nature of the image-to-3D generation problem, we devise a two-stage method that involves a reconstruction model and a diffusion model for view synthesis. Our reconstruction model first lifts one or more input images to the 3D space from a volume as the coarse-scale 3D representation followed by a tri-plane as the fine-scale 3D representation. To mitigate the ambiguity in occluded regions, our diffusion model then hallucinates missing details in the rendered images from tri-planes. We then introduce a new progressive refinement technique that iteratively applies the reconstruction and diffusion model to gradually synthesize novel views, boosting the overall quality of the 3D representations and their rendering. Empirical evaluation demonstrates the superiority of our method over state-of-the-art methods on the synthetic SRN-Car dataset, the in-the-wild CO3D dataset, and large-scale Objaverse dataset while achieving both sampling efficacy and multi-view consistency.

Paper Structure

This paper contains 17 sections, 3 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Our novel view synthesis addresses both single-view and few-view setting with high-quality reconstruction and rendering. Previous methods like OpenLRM openlrm and SplatterImage szymanowicz23splatter struggle to accurately reconstruct occluded regions, whereas our method can generate plausible results. For few-view reconstruction, LaRa LaRa experiences a rapid decline in performance as the number of input views decreases. In contrast, our method consistently delivers faithful reconstructions across a wide range of input views.
  • Figure 2: Our Stage 1 involves a reconstruction model to lift the input to 3D representations. Our model supports both single-view and few-view reconstruction, where all input features are aggregated into the volume decoder. The volume is then transformed into a triplane for rendering to novel view images and feature maps.
  • Figure 3: Our Stage 2 involves a conditional rendering diffusion model that aims to refine the rendered novel view from Stage 1 with additional details from a latent diffusion model.
  • Figure 4: Progressive inference. Our method reconstructs and generates intermediated views, progressively refining the quality of the 3D representation and its rendering.
  • Figure 5: Qualitative results on CO3D Dataset.
  • ...and 16 more figures