Table of Contents
Fetching ...

GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis

You Wang, Li Fang, Hao Zhu, Fei Hu, Long Ye, Zhan Ma

TL;DR

GoLF-NRT tackles few-shot view synthesis by fusing global scene context with local geometric cues through a near-linear 3D transformer and an adaptive, kernel-regressed sampling strategy. The method first builds a coarse global representation $ oldsymbol{Z}_g $ to produce a ray-specific $ oldsymbol{F}_g $, then guides local epipolar-line feature aggregation with this global cue to form $ oldsymbol{F}_{g-l} $, which is finally decoded to color via an MLP. Across LLFF, Blender, and Shiny datasets, GoLF-NRT achieves state-of-the-art performance for 1–3 input views and remains competitive in 10-view settings, with notable robustness to reflective and occluded regions. The combination of global-context guidance and adaptive local sampling reduces depth ambiguities and artifacts, enabling high-fidelity, view-consistent renderings suitable for real-world deployment.

Abstract

Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at https://github.com/KLMAV-CUC/GoLF-NRT.

GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis

TL;DR

GoLF-NRT tackles few-shot view synthesis by fusing global scene context with local geometric cues through a near-linear 3D transformer and an adaptive, kernel-regressed sampling strategy. The method first builds a coarse global representation to produce a ray-specific , then guides local epipolar-line feature aggregation with this global cue to form , which is finally decoded to color via an MLP. Across LLFF, Blender, and Shiny datasets, GoLF-NRT achieves state-of-the-art performance for 1–3 input views and remains competitive in 10-view settings, with notable robustness to reflective and occluded regions. The combination of global-context guidance and adaptive local sampling reduces depth ambiguities and artifacts, enabling high-fidelity, view-consistent renderings suitable for real-world deployment.

Abstract

Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at https://github.com/KLMAV-CUC/GoLF-NRT.

Paper Structure

This paper contains 29 sections, 7 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The integration of global context and local geometry in GoLF-NRT significantly enhances reconstruction accuracy. (a) Local geometric features (red) are extracted along the epipolar line of the target pixel, effectively capturing local geometric constraints, while surrounding features (yellow) provide the global context. (b) Relying solely on local features can introduce depth ambiguities and artifacts, while using only global features often fails to capture fine details. By combining both, GoLF-NRT achieves more precise and detailed reconstructions.
  • Figure 2: Overview of GoLF-NRT: 1) A FPN extracts multi-scale features from input views. 2) Coarse features are encoded to create scene representations, from which global context features are decoded for each target ray. 3) These global context features query local geometric features along the epipolar lines of all source views. The global and local features are concatenated and processed to predict the final target ray representation. 4) Pixel colors are directly predicted using a MLP.
  • Figure 3: Network architecture of Global Context Feature Extraction Module. The 3D transformer, with near-linear complexity, encodes the scene into a scene representation, while the decoder generates the global context feature for each target ray.
  • Figure 4: Illustration of how adaptive sampling with kernel regression enhances local geometric perception. (a) Uniform sampling, lacking surface depth constraints, distributes points broadly and randomly. (b) Adaptive sampling based solely on attention weights can lead to disorderly sampling due to the fundamental differences between feature similarity and the PDF used in volume rendering. (c) Our adaptive sampling with kernel regression bridges this gap, focusing samples within the most likely surface depth range, thereby improving subsequent image synthesis.
  • Figure 5: Qualitative comparison of GoLF-NRT with GNT, EVE-NeRF, and CaesarNeRF using 3 and 2 input views: (a) Trex scene (LLFF), Ficus scene (Blender), and Lab scene (Shiny); (b) Horns scene (LLFF), Hotdog scene (Blender), and Crest scene (Shiny). Each triplet shows the reconstructed image (left), a zoomed-in view (upper right), and its corresponding error map (lower right).
  • ...and 7 more figures