ShaRF: Shape-conditioned Radiance Fields from a Single View
Konstantinos Rematas, Ricardo Martin-Brualla, Vittorio Ferrari
TL;DR
<3-5 sentence high-level summary> ShaRF introduces a shape-conditioned radiance field that reconstructs a neural scene from a single image by learning a voxel-based geometric scaffold (shape network G) to guide an appearance-conditioned radiance field (appearance network F). The two networks are trained to disentangle shape and appearance, enabling faithful single-view inferences and controllable novel-view synthesis, with a two-stage, test-time optimization that also fine-tunes the networks. By integrating explicit geometry with neural rendering and employing symmetry and projection losses, ShaRF achieves strong performance on ShapeNet-SRN, generalizes to ShapeNet-Realistic and Pix3D, and provides competitive 3D shape reconstructions. The work demonstrates that a geometry-guided radiance field can be effectively learned from a single image and reused to render high-quality views across diverse domains. The practical impact lies in enabling single-image 3D inference with plausible geometry and appearance, adaptable to real-world data without requiring multiple views during inference.
Abstract
We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, with the object appearance being controlled by a second latent code. During inference, we optimize both the latent codes and the networks to fit a test image of a new object. The explicit disentanglement of shape and appearance allows our model to be fine-tuned given a single image. We can then render new views in a geometrically consistent manner and they represent faithfully the input object. Additionally, our method is able to generalize to images outside of the training domain (more realistic renderings and even real photographs). Finally, the inferred geometric scaffold is itself an accurate estimate of the object's 3D shape. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.
