Table of Contents
Fetching ...

NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model

Jinpeng Liu, Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Ying Shan, Yansong Tang

TL;DR

Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively, and demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge.

Abstract

We introduce NovelGS, a diffusion model for Gaussian Splatting (GS) given sparse-view images. Recent works leverage feed-forward networks to generate pixel-aligned Gaussians, which could be fast rendered. Unfortunately, the method was unable to produce satisfactory results for areas not covered by the input images due to the formulation of these methods. In contrast, we leverage the novel view denoising through a transformer-based network to generate 3D Gaussians. Specifically, by incorporating both conditional views and noisy target views, the network predicts pixel-aligned Gaussians for each view. During training, the rendered target and some additional views of the Gaussians are supervised. During inference, the target views are iteratively rendered and denoised from pure noise. Our approach demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge. Due to generative modeling of unseen regions, NovelGS effectively reconstructs 3D objects with consistent and sharp textures. Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively. We also demonstrate the potential of NovelGS in generative tasks, such as text-to-3D and image-to-3D, by integrating it with existing multiview diffusion models. We will make the code publicly accessible.

NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model

TL;DR

Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively, and demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge.

Abstract

We introduce NovelGS, a diffusion model for Gaussian Splatting (GS) given sparse-view images. Recent works leverage feed-forward networks to generate pixel-aligned Gaussians, which could be fast rendered. Unfortunately, the method was unable to produce satisfactory results for areas not covered by the input images due to the formulation of these methods. In contrast, we leverage the novel view denoising through a transformer-based network to generate 3D Gaussians. Specifically, by incorporating both conditional views and noisy target views, the network predicts pixel-aligned Gaussians for each view. During training, the rendered target and some additional views of the Gaussians are supervised. During inference, the target views are iteratively rendered and denoised from pure noise. Our approach demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge. Due to generative modeling of unseen regions, NovelGS effectively reconstructs 3D objects with consistent and sharp textures. Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively. We also demonstrate the potential of NovelGS in generative tasks, such as text-to-3D and image-to-3D, by integrating it with existing multiview diffusion models. We will make the code publicly accessible.

Paper Structure

This paper contains 19 sections, 6 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Comparison of pixel-aligned Gaussians reconstruction models and NovelGS. (a) Most existing models lgmgrmgs-lrm translate the input pixels into pixel-aligned Gaussians grm based on camera rays. (b) Conversely, we propose to denoise novel view images via the large Gaussian reconstruction model where the unseen parts of the objects could be reconstructed consistently.
  • Figure 2: High-fidelity 3D assets produced by NovelGS. It's designed for sparse-view reconstruction and operates in conjunction with various complementary tools, including text-to-image generation stable-diffusion, and image-to-multiview modeling zero123plus. This collaborative framework facilitates the generation of text-to-3D (bottom) and image-to-3D (center), as well as the reconstruction of real-world objects (top).
  • Figure 3: Pipeline of NovelGS model. We utilize a large transformer-based network to denoise noisy view images for 3D reconstruction. During inference, we initialize target views with pure noise. Then we concatenate the camera ray embedding (Plücker rays) and images (two clean views and one noisy view in the figure to reduce clutterness; four clean views and one noisy view in main experiments) as the input. Then we utilize the denoiser to predict the Gaussians and render the image from the noisy view. After that, we add noise to the noisy view images to timestep T-1. We loop this process until we get the final 3D Gaussians. During training, we add noise to the noisy view images based on the timestep and utilize the denoiser to predict 3D Gaussians. We train the denoiser module by rendering loss.
  • Figure 4: Visual comparisons to previous methods. The four-view input images are displayed in the leftmost column, while novel view renderings are compared on the right. Previous methods struggle to reconstruct high-frequency details and thin structures consistently. In contrast, our NovelGS demonstrates significantly improved performance in these scenarios. The PSNRs are provided beneath each image.
  • Figure 5: Qualitative results of different numbers of views. Setting 1: 1 clean view and 1 noisy view. Setting 2: 2 clean views and 1 noisy view. Setting 3: 3 clean views and 1 noisy view. Setting 4: 4 clean views and 2 noisy views. Setting 5: 4 clean views.
  • ...and 5 more figures