Table of Contents
Fetching ...

pixelNeRF: Neural Radiance Fields from One or Few Images

Alex Yu, Vickie Ye, Matthew Tancik, Angjoo Kanazawa

TL;DR

PixelNeRF addresses the limitation of NeRF's per-scene optimization by learning a scene-prior that maps one or a few input images to a NeRF representation. It achieves this by conditioning a NeRF on pixel-aligned image features from a fully convolutional encoder, enabling feed-forward rendering without test-time optimization and supporting arbitrary numbers of input views. The method operates in the input view's coordinate frame, enabling generalization to unseen objects and real scenes, demonstrated on ShapeNet and DTU where it outperforms state-of-the-art few-shot baselines. Overall, PixelNeRF provides a data-driven, view-conditioned 3D reconstruction approach with strong generalization and minimal supervision, advancing practical few-shot novel-view synthesis.

Abstract

We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website: https://alexyu.net/pixelnerf

pixelNeRF: Neural Radiance Fields from One or Few Images

TL;DR

PixelNeRF addresses the limitation of NeRF's per-scene optimization by learning a scene-prior that maps one or a few input images to a NeRF representation. It achieves this by conditioning a NeRF on pixel-aligned image features from a fully convolutional encoder, enabling feed-forward rendering without test-time optimization and supporting arbitrary numbers of input views. The method operates in the input view's coordinate frame, enabling generalization to unseen objects and real scenes, demonstrated on ShapeNet and DTU where it outperforms state-of-the-art few-shot baselines. Overall, PixelNeRF provides a data-driven, view-conditioned 3D reconstruction approach with strong generalization and minimal supervision, advancing practical few-shot novel-view synthesis.

Abstract

We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website: https://alexyu.net/pixelnerf

Paper Structure

This paper contains 41 sections, 7 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: NeRF from one or few images. We present pixelNeRF, a learning framework that predicts a Neural Radiance Field (NeRF) representation from a single (top) or few posed images (bottom). PixelNeRF can be trained on a set of multi-view images, allowing it to generate plausible novel view synthesis from very few input images without test-time optimization (bottom left). In contrast, NeRF has no generalization capabilities and performs poorly when only three input views are available (bottom right).
  • Figure 2: Proposed architecture in the single-view case. For a query point $\mathbf{x}$ along a target camera ray with view direction $\mathbf{d}$, a corresponding image feature is extracted from the feature volume $\mathbf{W}$ via projection and interpolation. This feature is then passed into the NeRF network $f$ along with the spatial coordinates. The output RGB and density value is volume-rendered and compared with the target pixel value. The coordinates $\mathbf{x}$ and $\mathbf{d}$ are in the camera coordinate system of the input view.
  • Figure 3: Category-specific single-view reconstruction benchmark. We train a separate model for cars and chairs and compare to SRN. The corresponding numbers may be found in Table \ref{['tab:single_cat']}.
  • Figure 4: Category-specific $2$-view reconstruction benchmark. We provide two views (left) to each model, and show two novel view renderings in each case (right). Please also refer to Table \ref{['tab:single_cat']}.
  • Figure 5: Category-agnostic single-view reconstruction. Going beyond the SRN benchmark, we train a single model to the 13 largest ShapeNet categories; we find that our approach produces superior visual results compared to a series of strong baselines. In particular, the model recovers fine detail and thin structure more effectively, even for outlier shapes. Quite visibly, images on monitors and tabletop textures are accurately reproduced; baselines representing the scene as a single latent vector cannot preserve such details of the input image. SRN's test-time latent inversion becomes less reliable as well in this setting. The corresponding quantitative evaluations are available in Table \ref{['tab:multi_cat']}. Due to space constraints, we show objects with interesting properties here. Please see the supplemental for sampled results.
  • ...and 13 more figures