Table of Contents
Fetching ...

Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

Jannik Endres, Oliver Hahn, Charles Corbière, Simone Schaub-Meyer, Stefan Roth, Alexandre Alahi

TL;DR

DFI-OmniStereo tackles the challenge of accurate omnidirectional depth from 360° imagery by integrating a large-scale monocular depth foundation model into an iterative stereo matching framework. The method employs a two-stage training strategy—Stage A for feature adaptation with the foundation frozen and Stage B for scale-invariant fine-tuning of the decoder—to preserve foundation-model generalization while adapting to omnidirectional data. Empirical results on the Helvipad real-world dataset show state-of-the-art disparity and depth performance, along with strong generalization and data-efficiency properties. The work demonstrates the practical potential of foundation-model–guided stereo for robust 360° scene understanding in mobile robotics.

Abstract

Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

TL;DR

DFI-OmniStereo tackles the challenge of accurate omnidirectional depth from 360° imagery by integrating a large-scale monocular depth foundation model into an iterative stereo matching framework. The method employs a two-stage training strategy—Stage A for feature adaptation with the foundation frozen and Stage B for scale-invariant fine-tuning of the decoder—to preserve foundation-model generalization while adapting to omnidirectional data. Empirical results on the Helvipad real-world dataset show state-of-the-art disparity and depth performance, along with strong generalization and data-efficiency properties. The work demonstrates the practical potential of foundation-model–guided stereo for robust 360° scene understanding in mobile robotics.

Abstract

Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

Paper Structure

This paper contains 28 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our proposed omnidirectional stereo matching approach DFI-OmniStereo. Given a pair of equirectangular images captured by two vertically stacked omnidirectional cameras, our method integrates a large-scale pre-trained monocular relative depth foundation model into an iterative stereo matching approach. DFI-OmniStereo improves disparity and depth estimation accuracy, significantly outperforming the previous state-of-the-art method. We visualize predicted disparity on a log-scale (red indicates high disparity and low depth; vice versa for blue).
  • Figure 2: Overview of DFI-OmniStereo. A shared depth foundation model (purple) is utilized to extract representations from a top and bottom image. Subsequently, an omnidirectional stereo matching head (pink) predicts disparity, utilizing the image features as follows: The intermediate representations and relative depth maps of both images are adapted to be processed as multi-scale feature maps by the iterative matching head. This head predicts a disparity map using vertical warping for cost volume construction. The training consists of two stages. In training stage A (blue), we adapt the stereo matching head to the omnidirectional data and the foundation model features (foundation model frozen) using a conventional stereo matching loss $\mathcal{L}_{A}$. In stage B (orange), we fine-tune the foundation model decoder and the stereo matching head, utilizing a scale-invariant logarithmic loss $\mathcal{L}_{B}$. Frozen and trainable modules are denoted with a snowflake and fire symbol, respectively.
  • Figure 3: Qualitative comparison on the Helvipad zayene2024helvipad test split. We visualize the bottom image, ground-truth disparity maps (°), and the predicted disparity maps (°) of the previous state-of-the-art method, 360-IGEV-Stereo, and of DFI-OmniStereo.
  • Figure 4: Training sample-efficient learning analysis using DFI-OmniStereo on the Helvipad dataset zayene2024helvipad. The training data for our method is a randomly sampled subset. 360-IGEV-Stereo zayene2024helvipad is visualized as the dashed line using $100\%$ of the training data for comparison.
  • Figure 5: Qualitative comparison of generalization to real-world images from wang2020360sd. We visualize the bottom image and the disparity prediction (°) of 360-IGEV-Stereo and DFI-OmniStereo (from top to bottom) using the hall, room, and stairs scene (from left to right).