Table of Contents
Fetching ...

CaesarNeRF: Calibrated Semantic Representation for Few-shot Generalizable Neural Rendering

Haidong Zhu, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Ram Nevatia, Luming Liang

TL;DR

CaesarNeRF proposes a calibrated semantic representation that fuses scene-level semantics with pixel-level features to improve few-shot generalizable neural rendering. It calibrates scene semantics across views using camera pose transformations and introduces sequential refinement to capture details at multiple levels, integrated into a GNT-based transformer framework. The method achieves state-of-the-art results across LLFF, Shiny, mip-NeRF 360, and MVImgNet, even with a single reference image, and adapts to other NeRF pipelines, highlighting strong generalization and transferability. While effective, it relies on input observations and may occasionally generate content beyond available data, signaling a need for careful consideration of potential negative impacts.

Abstract

Generalizability and few-shot learning are key challenges in Neural Radiance Fields (NeRF), often due to the lack of a holistic understanding in pixel-level rendering. We introduce CaesarNeRF, an end-to-end approach that leverages scene-level CAlibratEd SemAntic Representation along with pixel-level representations to advance few-shot, generalizable neural rendering, facilitating a holistic understanding without compromising high-quality details. CaesarNeRF explicitly models pose differences of reference views to combine scene-level semantic representations, providing a calibrated holistic understanding. This calibration process aligns various viewpoints with precise location and is further enhanced by sequential refinement to capture varying details. Extensive experiments on public datasets, including LLFF, Shiny, mip-NeRF 360, and MVImgNet, show that CaesarNeRF delivers state-of-the-art performance across varying numbers of reference views, proving effective even with a single reference image.

CaesarNeRF: Calibrated Semantic Representation for Few-shot Generalizable Neural Rendering

TL;DR

CaesarNeRF proposes a calibrated semantic representation that fuses scene-level semantics with pixel-level features to improve few-shot generalizable neural rendering. It calibrates scene semantics across views using camera pose transformations and introduces sequential refinement to capture details at multiple levels, integrated into a GNT-based transformer framework. The method achieves state-of-the-art results across LLFF, Shiny, mip-NeRF 360, and MVImgNet, even with a single reference image, and adapts to other NeRF pipelines, highlighting strong generalization and transferability. While effective, it relies on input observations and may occasionally generate content beyond available data, signaling a need for careful consideration of potential negative impacts.

Abstract

Generalizability and few-shot learning are key challenges in Neural Radiance Fields (NeRF), often due to the lack of a holistic understanding in pixel-level rendering. We introduce CaesarNeRF, an end-to-end approach that leverages scene-level CAlibratEd SemAntic Representation along with pixel-level representations to advance few-shot, generalizable neural rendering, facilitating a holistic understanding without compromising high-quality details. CaesarNeRF explicitly models pose differences of reference views to combine scene-level semantic representations, providing a calibrated holistic understanding. This calibration process aligns various viewpoints with precise location and is further enhanced by sequential refinement to capture varying details. Extensive experiments on public datasets, including LLFF, Shiny, mip-NeRF 360, and MVImgNet, show that CaesarNeRF delivers state-of-the-art performance across varying numbers of reference views, proving effective even with a single reference image.
Paper Structure (12 sections, 10 equations, 8 figures, 7 tables)

This paper contains 12 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Novel view synthesis for novel scenes using ONE reference view on Shiny wizadwongsa2021nex, LLFF mildenhall2019local, and MVImgNet yu2023mvimgnet (top to bottom). Each triplet of images corresponds to the results from GNT varma2022attention (left), CaesarNeRF (middle) and groundtruth (right).
  • Figure 2: Overview of CaesarNeRF. CaesarNeRF employs a shared encoder to capture two types of features from input views, including scene-level semantic representation $\{\bm S_n\}$ and pixel-level feature representation $\{\bm F_n\}$. We use the same encoder for both the scene-level semantic representation and the pixel-level embeddings. Following calibration and aggregation of $\{\bm S_n\}$ from various views, we concatenate it with the pixel-level fused feature, processed by the view transformer. Subsequent use of the ray-transformer, coupled with sequential refinement, enables us to render the final RGB values for each pixel in the target view. The output features serve as the input for the next stage, indicated by matching line colors.
  • Figure 3: An illustration of conflicting semantic meanings from multiple viewpoints of the same object. When observing the cup from distinct angles, the features extracted after pooling retain spatial information but are inconsistent in the scene-level semantic understanding, leading to conflicts across various reference images after aggregation.
  • Figure 4: Visualization of decoded feature maps for "orchid" in LLFF dataset, produced by ray transformersvarma2022attention at different stages. From left to right, the transformer stages increase in depth.
  • Figure 5: Comparative visualization of our proposed method against other state-of-the-art methods.
  • ...and 3 more figures