Table of Contents
Fetching ...

Real-time Multi-view Omnidirectional Depth Estimation for Real Scenarios based on Teacher-Student Learning with Unlabeled Data

Ming Li, Xiong Yang, Chaofan Wu, Jiaheng Li, Pinzhi Wang, Xuejiao Hu, Sidan Du, Yang Li

TL;DR

The paper tackles real-time omnidirectional depth estimation for edge devices while ensuring cross-scene generalization. It introduces Rt-OmniMVS, a lightweight framework built around Combined Spherical Sweeping and a 2D CNN-based cost aggregation, coupled with a teacher-student training regime that leverages unlabeled real-world data through pseudo-labels from a state-of-the-art stereo model. To support real-world validation, the authors present HexaMODE, a six-fisheye camera system on an edge computer, and Hexa360Depth, a large hybrid dataset with synthetic and real data. Experiments show Rt-OmniMVS achieves competitive accuracy with significantly improved real-time efficiency (≥15 fps) on edge hardware, along with strong generalization across indoor and outdoor scenarios. This work advances practical real-time 360° depth perception for autonomous driving and robotics by combining algorithmic efficiency with unlabeled-data training and real-world data collection.

Abstract

Omnidirectional depth estimation enables efficient 3D perception over a full 360-degree range. However, in real-world applications such as autonomous driving and robotics, achieving real-time performance and robust cross-scene generalization remains a significant challenge for existing algorithms. In this paper, we propose a real-time omnidirectional depth estimation method for edge computing platforms named Rt-OmniMVS, which introduces the Combined Spherical Sweeping method and implements the lightweight network structure to achieve real-time performance on edge computing platforms. To achieve high accuracy, robustness, and generalization in real-world environments, we introduce a teacher-student learning strategy. We leverage the high-precision stereo matching method as the teacher model to predict pseudo labels for unlabeled real-world data, and utilize data and model augmentation techniques for training to enhance performance of the student model Rt-OmniMVS. We also propose HexaMODE, an omnidirectional depth sensing system based on multi-view fisheye cameras and edge computation device. A large-scale hybrid dataset contains both unlabeled real-world data and synthetic data is collected for model training. Experiments on public datasets demonstrate that proposed method achieves results comparable to state-of-the-art approaches while consuming significantly less resource. The proposed system and algorithm also demonstrate high accuracy in various complex real-world scenarios, both indoors and outdoors, achieving an inference speed of 15 frames per second on edge computing platforms.

Real-time Multi-view Omnidirectional Depth Estimation for Real Scenarios based on Teacher-Student Learning with Unlabeled Data

TL;DR

The paper tackles real-time omnidirectional depth estimation for edge devices while ensuring cross-scene generalization. It introduces Rt-OmniMVS, a lightweight framework built around Combined Spherical Sweeping and a 2D CNN-based cost aggregation, coupled with a teacher-student training regime that leverages unlabeled real-world data through pseudo-labels from a state-of-the-art stereo model. To support real-world validation, the authors present HexaMODE, a six-fisheye camera system on an edge computer, and Hexa360Depth, a large hybrid dataset with synthetic and real data. Experiments show Rt-OmniMVS achieves competitive accuracy with significantly improved real-time efficiency (≥15 fps) on edge hardware, along with strong generalization across indoor and outdoor scenarios. This work advances practical real-time 360° depth perception for autonomous driving and robotics by combining algorithmic efficiency with unlabeled-data training and real-world data collection.

Abstract

Omnidirectional depth estimation enables efficient 3D perception over a full 360-degree range. However, in real-world applications such as autonomous driving and robotics, achieving real-time performance and robust cross-scene generalization remains a significant challenge for existing algorithms. In this paper, we propose a real-time omnidirectional depth estimation method for edge computing platforms named Rt-OmniMVS, which introduces the Combined Spherical Sweeping method and implements the lightweight network structure to achieve real-time performance on edge computing platforms. To achieve high accuracy, robustness, and generalization in real-world environments, we introduce a teacher-student learning strategy. We leverage the high-precision stereo matching method as the teacher model to predict pseudo labels for unlabeled real-world data, and utilize data and model augmentation techniques for training to enhance performance of the student model Rt-OmniMVS. We also propose HexaMODE, an omnidirectional depth sensing system based on multi-view fisheye cameras and edge computation device. A large-scale hybrid dataset contains both unlabeled real-world data and synthetic data is collected for model training. Experiments on public datasets demonstrate that proposed method achieves results comparable to state-of-the-art approaches while consuming significantly less resource. The proposed system and algorithm also demonstrate high accuracy in various complex real-world scenarios, both indoors and outdoors, achieving an inference speed of 15 frames per second on edge computing platforms.
Paper Structure (18 sections, 1 equation, 9 figures, 8 tables)

This paper contains 18 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The comparison of recent multi-view omnidirectional depth estimation methods on accuracy and inference time. The proposed Rt-OmniMVS has achieved competitive accuracy performance with fastest inference speed
  • Figure 2: The proposed Combined Spherical Sweeping and the comparison with conventional method. Conventional method is indicated by the gray arrows, which projects the featrure map of every input image onto the completed sphere and then stitches the features to construct two spherical features. The proposed method is illustrated by the red arrows, directly projects multi-view features into two spherical features, significantly reducing the computational cost of the projection process
  • Figure 3: The model structure of proposed Rt-OmniMVS. The proposed method utilizes Combined Spherical Sweeping to construct omnidirectional matching costs based on features of multi-view fisheye images, followed by cost aggregation to predict depth. The random rotation is leveraged to improve the performance. The model employs a lightweight structural design and multi-scale supervision
  • Figure 4: The diagram of proposed pseudo depth generation method. Multi-view fisheye images are projected into pinhole stereo image pairs in various directions to obtain depth maps based on stereo matching, and stitched together to construct a omnidirectional depth map. (a) presents the process of image projection and the generation of pinhole stereo pairs. (b) and (c) demonstrate the generation process of pseudo labels for camera systems with four and six fisheye cameras as input, respectively
  • Figure 5: The diagram of proposed teacher-student learning strategy. The student model is first pre-trained on the public synthetic dataset, and then trained with pseudo-labels inferred by the teacher model, while applying data and model augmentation to enhance performance
  • ...and 4 more figures