Table of Contents
Fetching ...

Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation

Mehdi Zayene, Jannik Endres, Albias Havolli, Charles Corbière, Salim Cherkaoui, Alexandre Kontouli, Alexandre Alahi

TL;DR

Helvipad provides a real-world omnidirectional stereo dataset to advance 360° depth estimation, offering pixel-wise labels derived from LiDAR projections and augmented density via depth completion. The authors adapt state-of-the-art stereo models to spherical geometry by incorporating a polar angle input and circular padding, introducing 360-IGEV-Stereo, which achieves superior performance on Helvipad. Comprehensive experiments show improved depth accuracy, boundary consistency, and cross-scene generalization, underscoring the dataset’s value for real-time navigation in indoor and outdoor human environments. The work establishes Helvipad as a robust testbed for developing and evaluating omnidirectional stereo methods and depth-perception pipelines.

Abstract

Despite progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, featuring 40K video frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with various lighting conditions. Collected using two 360° cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with an increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. The results show that while recent stereo methods perform decently, a challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, leading to improved performance.

Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation

TL;DR

Helvipad provides a real-world omnidirectional stereo dataset to advance 360° depth estimation, offering pixel-wise labels derived from LiDAR projections and augmented density via depth completion. The authors adapt state-of-the-art stereo models to spherical geometry by incorporating a polar angle input and circular padding, introducing 360-IGEV-Stereo, which achieves superior performance on Helvipad. Comprehensive experiments show improved depth accuracy, boundary consistency, and cross-scene generalization, underscoring the dataset’s value for real-time navigation in indoor and outdoor human environments. The work establishes Helvipad as a robust testbed for developing and evaluating omnidirectional stereo methods and depth-perception pipelines.

Abstract

Despite progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, featuring 40K video frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with various lighting conditions. Collected using two 360° cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with an increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. The results show that while recent stereo methods perform decently, a challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, leading to improved performance.

Paper Structure

This paper contains 50 sections, 14 equations, 19 figures, 14 tables, 1 algorithm.

Figures (19)

  • Figure 1: LiDAR to 360° image mapping and spherical disparity. The top and bottom cameras, separated by a baseline $B_{\text{cameras}}$, capture a shared projection point $P$, mapped to both image and LiDAR coordinates. Depth vectors $\vec{r}_{\text{top}}$ and $\vec{r}_{\text{bottom}}$ represent distances in each coordinate frame. The polar angles $\theta_b$ and $\theta_t$ represent the angles from the bottom and top cameras to $P$, respectively, while the angular disparity $d$ quantifies the angular difference between corresponding points in the two camera views.
  • Figure 2: Histograms of depth values for the entire dataset without depth completion, indoor and outdoor (day and night) sequences. Vertical dotted lines indicate the average depth for each setting. Depth values range from 0.5m to 225m, with averages of 8.1m overall, 5.4m for indoor scenes, and 9.2m for combined day and night outdoor scenes.
  • Figure 3: Overview of 360-IGEV-Stereo architecture. The model takes the circular padded top and bottom image as well as a polar angle map with equal size as an input. At the bottleneck of the feature network the feature map is concatenated with the encoded polar angle at 1/32 of the original image size. The encoded polar angle is also concatenated with the context feature maps at 1/4 resolution. Subsequently, the Combined Geometry Encoding Volume (CGEV) is constructed by vertical warping. The iterative refinement of the disparity with the ConvGRU and the spatial upsampling are equivalent to IGEV-Stereo xu2023iterative.
  • Figure 4: Depth MARE comparison across different depth estimation methods trained on original vs. augmented depth labels.
  • Figure 5: Cross-scene generalization analysis of each model when trained on different subsets (Indoor, Outdoor, Indoor+Outdoor, All) and evaluated under various testing conditions (indoor, outdoor, night outdoor). We use the depth MARE for comparison.
  • ...and 14 more figures