Table of Contents
Fetching ...

GRF: Learning a General Radiance Field for 3D Representation and Rendering

Alex Trevithick, Bo Yang

TL;DR

GRF addresses the challenge of general 3D representation and novel-view synthesis from only 2D observations by learning a general radiance field that generalizes across unseen objects, categories, and scenes. It combines per-pixel 2D features with geometry-aware reprojection and attention-based aggregation, feeding aggregated 3D-point features into a NeRF-like renderer to produce high-fidelity views. The approach achieves strong generalization on ShapeNet and Synthetic-NeRF datasets, and substantially improves single-scene results on real-world LLFF/3DScan data compared with NeRF-based methods. The work also provides insights into the role of attention in resolving occlusions and view-aggregation in neural rendering.

Abstract

We present a simple yet powerful neural network that implicitly represents and renders 3D objects and scenes only from 2D observations. The network models 3D geometries as a general radiance field, which takes a set of 2D images with camera poses and intrinsics as input, constructs an internal representation for each point of the 3D space, and then renders the corresponding appearance and geometry of that point viewed from an arbitrary position. The key to our approach is to learn local features for each pixel in 2D images and to then project these features to 3D points, thus yielding general and rich point representations. We additionally integrate an attention mechanism to aggregate pixel features from multiple 2D views, such that visual occlusions are implicitly taken into account. Extensive experiments demonstrate that our method can generate high-quality and realistic novel views for novel objects, unseen categories and challenging real-world scenes.

GRF: Learning a General Radiance Field for 3D Representation and Rendering

TL;DR

GRF addresses the challenge of general 3D representation and novel-view synthesis from only 2D observations by learning a general radiance field that generalizes across unseen objects, categories, and scenes. It combines per-pixel 2D features with geometry-aware reprojection and attention-based aggregation, feeding aggregated 3D-point features into a NeRF-like renderer to produce high-fidelity views. The approach achieves strong generalization on ShapeNet and Synthetic-NeRF datasets, and substantially improves single-scene results on real-world LLFF/3DScan data compared with NeRF-based methods. The work also provides insights into the role of attention in resolving occlusions and view-aggregation in neural rendering.

Abstract

We present a simple yet powerful neural network that implicitly represents and renders 3D objects and scenes only from 2D observations. The network models 3D geometries as a general radiance field, which takes a set of 2D images with camera poses and intrinsics as input, constructs an internal representation for each point of the 3D space, and then renders the corresponding appearance and geometry of that point viewed from an arbitrary position. The key to our approach is to learn local features for each pixel in 2D images and to then project these features to 3D points, thus yielding general and rich point representations. We additionally integrate an attention mechanism to aggregate pixel features from multiple 2D views, such that visual occlusions are implicitly taken into account. Extensive experiments demonstrate that our method can generate high-quality and realistic novel views for novel objects, unseen categories and challenging real-world scenes.

Paper Structure

This paper contains 25 sections, 3 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: A single model of our GRF infers high-quality novel views for new objects of seen and unseen categories, demonstrating its strong capability for 3D representation and rendering.
  • Figure 2: Our GRF projects each 3D point, $p$, to each of the $M$ input images, gathering per-pixel features from each view. These features are aggregated and fed to an MLP to infer $p$ with its color and volumetric density.
  • Figure 3: Our CNN module extracts robust per-pixel features from each input view using skip connections.
  • Figure 4: Reprojecting pixel features back to a 3D point $p$.
  • Figure 6: The aggregated point features and viewing direction are concatenated as input to an MLP to predict color and density for every point.
  • ...and 9 more figures