Table of Contents
Fetching ...

Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Kai-En Lin, Lin Yen-Chen, Wei-Sheng Lai, Tsung-Yi Lin, Yi-Chang Shih, Ravi Ramamoorthi

TL;DR

This work tackles single-image, unposed novel view synthesis by introducing a hybrid 3D representation that fuses global information from a Vision Transformer with local appearance features from a 2D CNN, which is then used to condition a NeRF MLP for volume rendering. The viewer-centered approach avoids canonical camera poses and demonstrates strong cross-category generalization, achieving state-of-the-art or competitive results on category-specific and category-agnostic tasks, as well as qualitative success on real images. Ablation studies confirm the complementary roles of global ViT features and local CNN features, and show that including viewing-direction information further enhances detail. The method advances immersive 3D content synthesis by effectively reconstructing occluded regions with richer geometry and texture, without relying on poses.

Abstract

Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view. To address this issue, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches.

Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

TL;DR

This work tackles single-image, unposed novel view synthesis by introducing a hybrid 3D representation that fuses global information from a Vision Transformer with local appearance features from a 2D CNN, which is then used to condition a NeRF MLP for volume rendering. The viewer-centered approach avoids canonical camera poses and demonstrates strong cross-category generalization, achieving state-of-the-art or competitive results on category-specific and category-agnostic tasks, as well as qualitative success on real images. Ablation studies confirm the complementary roles of global ViT features and local CNN features, and show that including viewing-direction information further enhances detail. The method advances immersive 3D content synthesis by effectively reconstructing occluded regions with richer geometry and texture, without relying on poses.

Abstract

Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view. To address this issue, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches.
Paper Structure (28 sections, 10 equations, 21 figures, 7 tables)

This paper contains 28 sections, 10 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Novel view synthesis in occluded regions. The visual quality of image-conditioned model (e.g., PixelNeRF yu2020pixelnerf) degrades significantly when pixels in the target view are invisible from the input. We propose to incorporate both global features from vision transformer (ViT) and local appearance features from convolutional networks to achieve significantly better rendering quality with more details in the occluded regions. Note that LPIPS zhang2018perceptual (lower is better) reflects the perceptual similarity better than PSNR.
  • Figure 2: The challenge of image-conditioned models in the presence of self-occlusion. To render a car's occluded wheel (blue dot) in the target view, image-conditioned models, e.g., PixelNeRF yu2020pixelnerf, query features along the ray, which corresponds to the car's window in the input view (red cross). Our method uses self-attention to learn long-range dependencies, which is able to find the most related features in the source view (green dot) for rendering a clear target view.
  • Figure 3: Illustration of different representations for a 3D object. (a) 1D latent code-based approaches chen2018implicit_decoderdupont2020equivariantjang2021codenerfOccupancy_NetworksDVRPark_2019_CVPR encode the 3D object in an 1D vector. (b) 2D image-based methods pifuSHNMKL19yu2020pixelnerf are conditioned on the per-pixel image features to reconstruct any 3D point. (c) 3D voxel-based approaches fast-and-explicit-neural-view-synthesislombardi2019neural treat a 3D object as a collection of voxels and apply 3D convolutions to generate color and density vector RGB$\sigma$.
  • Figure 4: Overview of our rendering pipeline. We first divide an input image into $N = 8 \times 8$ patches $\textbf{P}$. Each patch is flattened and linearly projected to an image token $\textbf{P}_l$. The transformer encoder takes the image tokens and learnable positional embeddings $\textbf{e}$ as input to extract global information as a set of latent features $f$ (Sec. \ref{['sec:vit']}). Then, we decode the latent feature into multi-level feature maps $\textbf{W}_G$ using a convolutional decoder. In addition to global features, we use another 2D CNN $\mathcal{G}_L$ to obtain local image features (Sec. \ref{['sec:cnn']}). Finally, we sample the features for volume rendering using the NeRF MLP (Sec. \ref{['sec:volume_rendering']}).
  • Figure 5: Category-specific view synthesis on Chairs. The results of SRN and PixelNeRF are often too blurry, especially on the legs that are not visible in the input views. Our method can generate novel views with clearer structures and sharper edges.
  • ...and 16 more figures