Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding
Jay Bhanushali, Manivannan Muniyandi, Praneeth Chakravarthula
TL;DR
This work tackles cross-domain depth and normal estimation for outdoor omnidirectional imagery by training on a synthetic dataset (OmniHorizon) and applying to real-world scenes. It introduces UBotNet, a hybrid UNet-Bottleneck Transformer that captures both local details and global context for consistent depth and normals, along with a lighter UBotNet Lite version. The OmniHorizon dataset provides rich dynamic outdoor environments with varied lighting and participants to support robust cross-domain learning, and the approach demonstrates strong sim-to-real transfer and real-world performance, while also outlining limitations and avenues for improvement. Overall, the combination of synthetic outdoor data and the UBotNet architecture advances reliable monocular omnidirectional scene understanding with practical implications for AR/VR, SLAM, and autonomous perception.
Abstract
We present a cross-domain inference technique that learns from synthetic data to estimate depth and normals for in-the-wild omnidirectional 3D scenes encountered in real-world uncontrolled settings. To this end, we introduce UBotNet, an architecture that combines UNet and Bottleneck Transformer elements to predict consistent scene normals and depth. We also introduce the OmniHorizon synthetic dataset containing 24,335 omnidirectional images that represent a wide variety of outdoor environments, including buildings, streets, and diverse vegetation. This dataset is generated from expansive, lifelike virtual spaces and encompasses dynamic scene elements, such as changing lighting conditions, different times of day, pedestrians, and vehicles. Our experiments show that UBotNet achieves significantly improved accuracy in depth estimation and normal estimation compared to existing models. Lastly, we validate cross-domain synthetic-to-real depth and normal estimation on real outdoor images using UBotNet trained solely on our synthetic OmniHorizon dataset, demonstrating the potential of both the synthetic dataset and the proposed network for real-world scene understanding applications.
