Table of Contents
Fetching ...

A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction

Jianghao Shen, Nan Xue, Tianfu Wu

TL;DR

This work tackles single-view 3D reconstruction by augmenting the existing Splatter Image with a hierarchical per-pixel representation: each pixel has a parent Gaussian plus a small set of child Gaussians whose parameters are predicted by lightweight MLPs conditioned on the parent features and the target view. The method leverages world-coordinate target-view conditioning to guide view-specific refinements, enabling better recovery of occluded content while keeping computation largely on par with prior approaches. Empirically, the approach achieves state-of-the-art results on ShapeNet-SRN and CO3D across Cars, Chairs, Hydrants, and Teddybears, with ablations showing the necessity of target-view conditioning and the robustness of the design. This hierarchical, view-aware refinement yields more faithful novel view synthesis from a single image, advancing practical single-view 3D reconstruction for real-time rendering and scene understanding.

Abstract

Learning 3D scene representation from a single-view image is a long-standing fundamental problem in computer vision, with the inherent ambiguity in predicting contents unseen from the input view. Built on the recently proposed 3D Gaussian Splatting (3DGS), the Splatter Image method has made promising progress on fast single-image novel view synthesis via learning a single 3D Gaussian for each pixel based on the U-Net feature map of an input image. However, it has limited expressive power to represent occluded components that are not observable in the input view. To address this problem, this paper presents a Hierarchical Splatter Image method in which a pixel is worth more than one 3D Gaussians. Specifically, each pixel is represented by a parent 3D Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron (MLP) which takes as input the projected image features of a parent 3D Gaussian and the embedding of a target camera view. Both parent and child 3D Gaussians are learned end-to-end in a stage-wise way. The joint condition of input image features from eyes of the parent Gaussians and the target camera position facilitates learning to allocate child Gaussians to ``see the unseen'', recovering the occluded details that are often missed by parent Gaussians. In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art performance obtained, especially showing promising capabilities of reconstructing occluded contents in the input view.

A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction

TL;DR

This work tackles single-view 3D reconstruction by augmenting the existing Splatter Image with a hierarchical per-pixel representation: each pixel has a parent Gaussian plus a small set of child Gaussians whose parameters are predicted by lightweight MLPs conditioned on the parent features and the target view. The method leverages world-coordinate target-view conditioning to guide view-specific refinements, enabling better recovery of occluded content while keeping computation largely on par with prior approaches. Empirically, the approach achieves state-of-the-art results on ShapeNet-SRN and CO3D across Cars, Chairs, Hydrants, and Teddybears, with ablations showing the necessity of target-view conditioning and the robustness of the design. This hierarchical, view-aware refinement yields more faithful novel view synthesis from a single image, advancing practical single-view 3D reconstruction for real-time rendering and scene understanding.

Abstract

Learning 3D scene representation from a single-view image is a long-standing fundamental problem in computer vision, with the inherent ambiguity in predicting contents unseen from the input view. Built on the recently proposed 3D Gaussian Splatting (3DGS), the Splatter Image method has made promising progress on fast single-image novel view synthesis via learning a single 3D Gaussian for each pixel based on the U-Net feature map of an input image. However, it has limited expressive power to represent occluded components that are not observable in the input view. To address this problem, this paper presents a Hierarchical Splatter Image method in which a pixel is worth more than one 3D Gaussians. Specifically, each pixel is represented by a parent 3D Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron (MLP) which takes as input the projected image features of a parent 3D Gaussian and the embedding of a target camera view. Both parent and child 3D Gaussians are learned end-to-end in a stage-wise way. The joint condition of input image features from eyes of the parent Gaussians and the target camera position facilitates learning to allocate child Gaussians to ``see the unseen'', recovering the occluded details that are often missed by parent Gaussians. In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art performance obtained, especially showing promising capabilities of reconstructing occluded contents in the input view.
Paper Structure (14 sections, 9 equations, 5 figures, 7 tables)

This paper contains 14 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of the proposed Hierarchical Splatter Image in comparison with the vanilla Splatter Image szymanowicz2023splatter. The former is built on the latter. The main difference lies in their different answers to the question of how many 3D Gaussians a pixel is worth in learning single-view image 3D reconstruction. Our proposed method generalizes the one-to-one pixel-to-3D-Gaussian mapping utilized in the vanilla Splatter Image to an one-to-many mapping with a two-layer hierarchical parent-child 3D Gaussian representation. We show the proposed method can sensibly recover occluded parts (e.g. chair legs). The images are from the ShapeNet-SRN dataset chang2015shapenet. See text for details.
  • Figure 2: Illustration of the proposed Hierarchical Splatter Image which is built on, and aims to address two issues of, the vanilla Splatter Image szymanowicz2023splatter. During training, we estimate the parameters of the image encoder, parent 3D Gaussian regressor and MLPs using the rendering loss between the 3DGS rendered images using the learned parent-child 3D Gaussian and the ground-truth images. During inference, we have an input view of an object instance that is unseen during training (e.g., the input in Fig. \ref{['fig:teaser']}), we first compute the parent Gaussians. Then, based on a target view of interest, we compute the child Gaussians via MLPs. With all the 3D Gaussians computed, we can synthesize the image via 3DGS rendering. In comparisons, the vanilla Splatter Image computes the parent Gaussian only and renders images for any target views. Our method entails executing MLPs on top of the parent Gaussians for each target view. We show this overhead is negligible in terms of FLOPs. See text for details.
  • Figure 3: Qualitative comparisons between the vanilla Splatter Image and our proposed Hierarchical Splatter Image on Chairs and Cars in the ShapeNet-SRN test dataset.
  • Figure 4: Visulization comparison of renders using only parent Gaussians vs full rendering with parent and child Gaussians.
  • Figure 5: Qualitative comparisons between the vanilla Splatter Image and our proposed Hierarchical Splatter Image on Hydrants and Teddy Bears in the CO3D test dataset.