Table of Contents
Fetching ...

MSI-NeRF: Linking Omni-Depth with View Synthesis through Multi-Sphere Image aided Generalizable Neural Radiance Field

Dongyu Yan, Guanyu Huang, Fengyu Quan, Haoyao Chen

TL;DR

This work introduces MSI-NeRF, which combines deep learning omnidirectional depth estimation and novel view synthesis and builds an implicit radiance field using spatial points and interpolated 3D feature vectors as input, which can simultaneously realize omnidirectional depth estimation and 6DoF view synthesis.

Abstract

Panoramic observation using fisheye cameras is significant in virtual reality (VR) and robot perception. However, panoramic images synthesized by traditional methods lack depth information and can only provide three degrees-of-freedom (3DoF) rotation rendering in VR applications. To fully preserve and exploit the parallax information within the original fisheye cameras, we introduce MSI-NeRF, which combines deep learning omnidirectional depth estimation and novel view synthesis. We construct a multi-sphere image as a cost volume through feature extraction and warping of the input images. We further build an implicit radiance field using spatial points and interpolated 3D feature vectors as input, which can simultaneously realize omnidirectional depth estimation and 6DoF view synthesis. Leveraging the knowledge from depth estimation task, our method can learn scene appearance by source view supervision only. It does not require novel target views and can be trained conveniently on existing panorama depth estimation datasets. Our network has the generalization ability to reconstruct unknown scenes efficiently using only four images. Experimental results show that our method outperforms existing methods in both depth estimation and novel view synthesis tasks.

MSI-NeRF: Linking Omni-Depth with View Synthesis through Multi-Sphere Image aided Generalizable Neural Radiance Field

TL;DR

This work introduces MSI-NeRF, which combines deep learning omnidirectional depth estimation and novel view synthesis and builds an implicit radiance field using spatial points and interpolated 3D feature vectors as input, which can simultaneously realize omnidirectional depth estimation and 6DoF view synthesis.

Abstract

Panoramic observation using fisheye cameras is significant in virtual reality (VR) and robot perception. However, panoramic images synthesized by traditional methods lack depth information and can only provide three degrees-of-freedom (3DoF) rotation rendering in VR applications. To fully preserve and exploit the parallax information within the original fisheye cameras, we introduce MSI-NeRF, which combines deep learning omnidirectional depth estimation and novel view synthesis. We construct a multi-sphere image as a cost volume through feature extraction and warping of the input images. We further build an implicit radiance field using spatial points and interpolated 3D feature vectors as input, which can simultaneously realize omnidirectional depth estimation and 6DoF view synthesis. Leveraging the knowledge from depth estimation task, our method can learn scene appearance by source view supervision only. It does not require novel target views and can be trained conveniently on existing panorama depth estimation datasets. Our network has the generalization ability to reconstruct unknown scenes efficiently using only four images. Experimental results show that our method outperforms existing methods in both depth estimation and novel view synthesis tasks.
Paper Structure (17 sections, 11 equations, 9 figures, 3 tables)

This paper contains 17 sections, 11 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our method uses the images captured from four fisheye cameras arranged in a panoramic configuration as input. Aided by a multi-sphere image, a generalizable omnidirectional radiance field can be produced. From the radiance field, we can query occupancy and color information of any spatial position and ray direction. Then, leveraging the volume rendering formula, novel view synthesis, and depth estimation can be accomplished.
  • Figure 2: Structure of our method. Our method can be divided into three parts. First, by 2D feature extraction and warping, a multi-sphere image (MSI) representation can be built. Then, through geometric and appearance 3D decoders, explicit features ($\mathbf{f}_{geo}$, $\mathbf{f}_{appr}$) containing prior information can be obtained. They are then fed into NeRF implicit MLP along with point position $\mathbf{x}$, ray direction $\mathbf{d}$, and projected color $\mathbf{c}$. The output occupancy and color are used for fisheye color image and depth image rendering, forming the final supervision loss.
  • Figure 3: Qualitative results of our depth estimation experiment. We compare the generated omnidirectional depth map with the ground truth depth from the dataset. Our method can generate fine-grained depth estimations while getting rid of over-fitting due to rich texture.
  • Figure 4: Visualization of our Replica360 dataset. The fisheye source views (a), and the multi-view omnidirectional target images (b) are captured from the Replica simulator.
  • Figure 5: Qualitative results of our novel view synthesis experiment. We generate novel panoramic images in the dataset's target view location and compare them with the ground truth. Our method can generate high-quality and consistent rendering results, avoiding blurring and ghosting artifacts.
  • ...and 4 more figures