Table of Contents
Fetching ...

CUBE360: Learning Cubic Field Representation for Monocular 360 Depth Estimation for Virtual Reality

Wenjie Chang, Hao Ai, Tianzhu Zhang, Lin Wang

TL;DR

This work proposes a novel method, named CUBE360, that learns a cubic field composed of multiple MPIs from a single panoramic image for depth estimation at any view direction and highlights its effectiveness in downstream applications, such as VR roaming and visual effects, underscoring CUBE360's potential to enhance immersive experiences.

Abstract

Panoramic images provide comprehensive scene information and are suitable for VR applications. Obtaining corresponding depth maps is essential for achieving immersive and interactive experiences. However, panoramic depth estimation presents significant challenges due to the severe distortion caused by equirectangular projection (ERP) and the limited availability of panoramic RGB-D datasets. Inspired by the recent success of neural rendering, we propose a novel method, named $\mathbf{CUBE360}$, that learns a cubic field composed of multiple MPIs from a single panoramic image for $\mathbf{continuous}$ depth estimation at any view direction. Our CUBE360 employs cubemap projection to transform an ERP image into six faces and extract the MPIs for each, thereby reducing the memory consumption required for MPI processing of high-resolution data. Additionally, this approach avoids the computational complexity of handling the uneven pixel distribution inherent to equirectangular projectio. An attention-based blending module is then employed to learn correlations among the MPIs of cubic faces, constructing a cubic field representation with color and density information at various depth levels. Furthermore, a novel sampling strategy is introduced for rendering novel views from the cubic field at both cubic and planar scales. The entire pipeline is trained using photometric loss calculated from rendered views within a self-supervised learning approach, enabling training on 360 videos without depth annotations. Experiments on both synthetic and real-world datasets demonstrate the superior performance of CUBE360 compared to prior SSL methods. We also highlight its effectiveness in downstream applications, such as VR roaming and visual effects, underscoring CUBE360's potential to enhance immersive experiences.

CUBE360: Learning Cubic Field Representation for Monocular 360 Depth Estimation for Virtual Reality

TL;DR

This work proposes a novel method, named CUBE360, that learns a cubic field composed of multiple MPIs from a single panoramic image for depth estimation at any view direction and highlights its effectiveness in downstream applications, such as VR roaming and visual effects, underscoring CUBE360's potential to enhance immersive experiences.

Abstract

Panoramic images provide comprehensive scene information and are suitable for VR applications. Obtaining corresponding depth maps is essential for achieving immersive and interactive experiences. However, panoramic depth estimation presents significant challenges due to the severe distortion caused by equirectangular projection (ERP) and the limited availability of panoramic RGB-D datasets. Inspired by the recent success of neural rendering, we propose a novel method, named , that learns a cubic field composed of multiple MPIs from a single panoramic image for depth estimation at any view direction. Our CUBE360 employs cubemap projection to transform an ERP image into six faces and extract the MPIs for each, thereby reducing the memory consumption required for MPI processing of high-resolution data. Additionally, this approach avoids the computational complexity of handling the uneven pixel distribution inherent to equirectangular projectio. An attention-based blending module is then employed to learn correlations among the MPIs of cubic faces, constructing a cubic field representation with color and density information at various depth levels. Furthermore, a novel sampling strategy is introduced for rendering novel views from the cubic field at both cubic and planar scales. The entire pipeline is trained using photometric loss calculated from rendered views within a self-supervised learning approach, enabling training on 360 videos without depth annotations. Experiments on both synthetic and real-world datasets demonstrate the superior performance of CUBE360 compared to prior SSL methods. We also highlight its effectiveness in downstream applications, such as VR roaming and visual effects, underscoring CUBE360's potential to enhance immersive experiences.
Paper Structure (16 sections, 23 equations, 10 figures, 4 tables)

This paper contains 16 sections, 23 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We show results from MPIs and our cubic representation. The proposed cubic field produces consistent panoramic depth estimation against estimation from MPIs.
  • Figure 2: An overview of the proposed pipeline. An input panorama is split into six cubic faces, each capturing a different scene view. A convolutional-based network takes the cubic faces as inputs and generates the MPIs $\left\{ \mathbf{MPI}^i_0 | i \in \left\{ B, D, F, L, R, U\right\} \right\}$ for each view. The MPIs capture the scene’s appearance and geometry by representing the RGB value $c$ and density values $\sigma$ of imaging planes at a set depth levels $\mathbf{D}$. Then, a series of blending operations are proposed to update and integrate separate MPIs from different faces. The integrated features are then used to extract a cubic field, manifested as the fused MPIs $\left\{ \mathbf{MPI}^i_t | i \in \left\{ B, D, F, L, R, U\right\} \right\}$ at depth set $\mathbf{D}$. Novel views are rendered from the cubic field at two scales and utilized to construct photometric loss for supervision.
  • Figure 3: Illustration of the inter-face blending and cube-ERP blending processes in CUBE360. For inter-face blending, the Multi-Plane Images (MPIs) from the six cubic faces are tokenized and fed into a self-attention module to enhance the holistic representation of the cubic field. The positional encoding is applied based on spherical coordinates, and the resulting attention matrix helps capture interactions between tokens. In the cube-ERP blending stage, global ERP features, extracted through convolution and pooling, are integrated with the cubic field tokens using cross-attention. The final output restores six feature maps, representing the enhanced geometry and color information for each cubic face.
  • Figure 4: Illustration of the padding blending. (a) represents the proposed padding blending method, where $\tilde{\mathbf{MPI}}^i$ is concatenated with the corresponding positional information, followed by convolution and ReLU activation to generate the cubic field. In (b), the left image illustrates the adjacency relationship between the target cubic face and the other five faces. As depicted in the right image, this adjacency relationship enables us to integrate the information at the edge of the adjacent face into the target cubic face, thereby achieving feature fusion at the cubic level.
  • Figure 5: Network Details of the adopted Encoder-Decoder architecture.$MPI_{w/4}$ and $MPI_{w/8}$ are Multi-Plane Images predicted at resolutions $[w/4,w/4]$ and $[w/8,w/8]$, respectively. $MPI_0^i$ is the predicted MPIs at the resolution $[w/2,w/2]$ and is further fed into the proposed blending modules.
  • ...and 5 more figures