Table of Contents
Fetching ...

SphereFusion: Efficient Panorama Depth Estimation via Gated Fusion

Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, Fei Deng

TL;DR

SphereFusion tackles panorama depth estimation by jointly leveraging equirectangular and spherical projections. It introduces GateFuse, a gated feature fusion mechanism, and employs a cache-enabled mesh encoder–decoder to estimate depth directly in the spherical domain, balancing detail capture with efficiency. Across three public panorama datasets, the method achieves competitive accuracy while delivering state-of-the-art inference speed, exemplified by real-time performance on high-end GPUs. This approach enables robust, scalable panorama sensing suitable for robotics and autonomous systems where both accuracy and speed are critical.

Abstract

Due to the rapid development of panorama cameras, the task of estimating panorama depth has attracted significant attention from the computer vision community, especially in applications such as robot sensing and autonomous driving. However, existing methods relying on different projection formats often encounter challenges, either struggling with distortion and discontinuity in the case of equirectangular, cubemap, and tangent projections, or experiencing a loss of texture details with the spherical projection. To tackle these concerns, we present SphereFusion, an end-to-end framework that combines the strengths of various projection methods. Specifically, SphereFusion initially employs 2D image convolution and mesh operations to extract two distinct types of features from the panorama image in both equirectangular and spherical projection domains. These features are then projected onto the spherical domain, where a gate fusion module selects the most reliable features for fusion. Finally, SphereFusion estimates panorama depth within the spherical domain. Meanwhile, SphereFusion employs a cache strategy to improve the efficiency of mesh operation. Extensive experiments on three public panorama datasets demonstrate that SphereFusion achieves competitive results with other state-of-the-art methods, while presenting the fastest inference speed at only 17 ms on a 512$\times$1024 panorama image.

SphereFusion: Efficient Panorama Depth Estimation via Gated Fusion

TL;DR

SphereFusion tackles panorama depth estimation by jointly leveraging equirectangular and spherical projections. It introduces GateFuse, a gated feature fusion mechanism, and employs a cache-enabled mesh encoder–decoder to estimate depth directly in the spherical domain, balancing detail capture with efficiency. Across three public panorama datasets, the method achieves competitive accuracy while delivering state-of-the-art inference speed, exemplified by real-time performance on high-end GPUs. This approach enables robust, scalable panorama sensing suitable for robotics and autonomous systems where both accuracy and speed are critical.

Abstract

Due to the rapid development of panorama cameras, the task of estimating panorama depth has attracted significant attention from the computer vision community, especially in applications such as robot sensing and autonomous driving. However, existing methods relying on different projection formats often encounter challenges, either struggling with distortion and discontinuity in the case of equirectangular, cubemap, and tangent projections, or experiencing a loss of texture details with the spherical projection. To tackle these concerns, we present SphereFusion, an end-to-end framework that combines the strengths of various projection methods. Specifically, SphereFusion initially employs 2D image convolution and mesh operations to extract two distinct types of features from the panorama image in both equirectangular and spherical projection domains. These features are then projected onto the spherical domain, where a gate fusion module selects the most reliable features for fusion. Finally, SphereFusion estimates panorama depth within the spherical domain. Meanwhile, SphereFusion employs a cache strategy to improve the efficiency of mesh operation. Extensive experiments on three public panorama datasets demonstrate that SphereFusion achieves competitive results with other state-of-the-art methods, while presenting the fastest inference speed at only 17 ms on a 5121024 panorama image.

Paper Structure

This paper contains 30 sections, 8 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Comparison with BiFuse wang2020bifuse, UniFuse jiang2021unifuse, SliceNet pintore2021slicenet, PanoFormer shen2022panoformer, OminiFusion li2022omnifusion, SphereDepth yan2022spheredepth, HohoNet sun2021hohonet on Stanford2D3D armeni2017joint with resolution of $512 \times 1024$. The horizontal axis is the FPS, and the vertical axis is $\delta(1.25) (\%)$, which counts the percentage of the absolute relative difference between the prediction and the ground truth that is less than 1.25. The higher FPS and higher $\delta(1.25) (\%)$ mean better.
  • Figure 2: Given a panorama image in the equirectangular projection and the spherical projection, SphereFusion simultaneously extracts features by a 2D image encoder and a mesh encoder, which follows the ResNet structure he2016deep, then fuses these features by the Gate Fusion module in the spherical projection, and finally estimates the depth map through the mesh decoder.
  • Figure 3: The ideal representation of a panorama image is the sphere, but it is impractical. The equirectangular projection is the most popular method, but it suffers from distortion at the poles and discontinuity at the borders. The spherical mesh can approximate the sphere, and their difference becomes smaller with higher MR.
  • Figure 4: Mesh Operations includes Mesh Convolution and the Mesh Pooling/Unpooling yan2022spheredepth, which relies on the relationship between triangles of the spherical mesh.
  • Figure 5: We implement BiFuse wang2020bifuse, UniFuse jiang2021unifuse, and our GateFuse to fuse features from spherical projection $F_{sp}$ and equirectangular projection $F_{eq}$. Unlike BiFuse and UniFuse select features from $F_{eq}$ and fuse them to $F_{sp}$, GateFuse selects features from $F_{sp}$ and $F_{eq}$.
  • ...and 10 more figures