Table of Contents
Fetching ...

Is Attention All That NeRF Needs?

Mukund Varma T, Peihao Wang, Xuxi Chen, Tianlong Chen, Subhashini Venugopalan, Zhangyang Wang

TL;DR

The paper presents Generalizable NeRF Transformer (GNT), a two-stage transformer framework that reconstructs neural radiance fields and renders novel views withoutscene-specific optimization. A view transformer, constrained by epipolar geometry, aggregates multi-view features into coordinate-aligned representations, while a ray transformer renders new viewpoints via attention-driven, learned rendering along sampled rays. GNT achieves state-of-the-art performance in both single-scene and cross-scene generalization, including challenging cases with refraction and reflection, and offers interpretable attention maps that align with depth and occlusion cues. These results suggest transformers can serve as a universal modeling tool for graphics, effectively replacing handcrafted rendering equations with learned, geometry-aware attention mechanisms.

Abstract

We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modeling tool for graphics. Please refer to our project page for video results: https://vita-group.github.io/GNT/.

Is Attention All That NeRF Needs?

TL;DR

The paper presents Generalizable NeRF Transformer (GNT), a two-stage transformer framework that reconstructs neural radiance fields and renders novel views withoutscene-specific optimization. A view transformer, constrained by epipolar geometry, aggregates multi-view features into coordinate-aligned representations, while a ray transformer renders new viewpoints via attention-driven, learned rendering along sampled rays. GNT achieves state-of-the-art performance in both single-scene and cross-scene generalization, including challenging cases with refraction and reflection, and offers interpretable attention maps that align with depth and occlusion cues. These results suggest transformers can serve as a universal modeling tool for graphics, effectively replacing handcrafted rendering equations with learned, geometry-aware attention mechanisms.

Abstract

We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modeling tool for graphics. Please refer to our project page for video results: https://vita-group.github.io/GNT/.
Paper Structure (42 sections, 9 equations, 11 figures, 7 tables, 3 algorithms)

This paper contains 42 sections, 9 equations, 11 figures, 7 tables, 3 algorithms.

Figures (11)

  • Figure 1: Overview of Generalizable NeRF Transformer (GNT): 1) Identify source views for a given target view, 2) Extract features for epipolar points using a trainable U-Net-like model, 3) For each ray in the target view, sample points and directly predict target pixel's color by aggregating view-wise features (View Transformer) and across points along a ray (Ray Transformer).
  • Figure 2: Detailed network architectures of view transformer and ray transformer in GNT, where $X$ represents the epipolar features, $X_{0}$ represents aggregated ray features, $\{x, d, \Delta d\}$ indicates point coordinates, viewing direction, and relative directions of source views with respect to the target view.
  • Figure 3: Qualitative results for single-scene rendering. In the Orchids scene from LLFF (first row), GNT recovers the shape of the leaves more accurately. In the Drums scene from Blender (second row), GNT's learnable renderer is able to model physical phenomena like specular reflections.
  • Figure 4: Qualitative results of GNT for generalizable rendering on the the complex Shiny dataset, that contains more refractions and reflection. A pre-trained GNT can naturally adapt to complex refractions through test tube, and the interference patterns on the disk with higher quality.
  • Figure 5: Qualitative results for the cross-scene rendering. On the unseen Flowers (first row) and Fern (second row) scenes, GNT recovers the edges of petals and pillars more accurately than IBRNet and visually comparable to NeuRay.
  • ...and 6 more figures