Table of Contents
Fetching ...

Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization

Lehuai Xu, Weiming Zhang, Yang Li, Sidan Du, Lin Wang

Abstract

Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.

Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization

Abstract

Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.
Paper Structure (16 sections, 5 equations, 5 figures, 3 tables)

This paper contains 16 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a,c) Fisheye inputs with randomly injected local blur and noise, simulating out-of-focus or corrupted observations. (b,d) Depth predictions from our backbone (top) versus FreeOmniMVS (bottom). Our method suppresses artifacts around degraded regions and preserves thin structures and cross-camera consistency across the full $360^\circ$ field of view.
  • Figure 2: Overall architecture of FreeOmniMVS. Given four fisheye images, Stage 1 extracts multi-scale unary features and performs OmniMVS-style spherical sweeping to build per-view feature volumes on the ERP sphere. Stage 2 applies the proposed VCT to construct pairwise correlation volumes over all camera combinations and aggregates them, via sparse Top-$k$ attention, into a visibility-aware consistency volume $\mathcal{C}_{\text{fused}}$. Stage 3 uses a lightweight context fuser to build a global context volume and feeds both context and consistency features into a RAFT-Stereo-style recurrent updater, which starts from a zero inverse-depth map and iteratively refines it, followed by convex upsampling to obtain the final high-resolution omnidirectional depth.
  • Figure 3: Overview of the proposed VCT. For each voxel $(d,\mathbf{x})$, VCT takes the vector of pairwise correlations across all camera pairs, builds a self-similarity matrix over view pairs, and applies attention with Top-$k$ sparsification, with Gumbel perturbation during training, to aggregate them into a fused scalar consistency score.
  • Figure 4: Qualitative comparison on clean test images from OmniThings, OmniHouse, and Sunny.
  • Figure 5: Corner cases on occlusion-augmented OmniHouse and Sunny. (a,c) Fisheye inputs with strong local blur or noise. (b,d) Depth predictions from RomniStereo${}_{32}$-ft (top) and FreeOmniMVS-ft (bottom). Our method better preserves thin structures and reduces large local errors around degraded regions.