Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding

Jay Bhanushali; Manivannan Muniyandi; Praneeth Chakravarthula

Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding

Jay Bhanushali, Manivannan Muniyandi, Praneeth Chakravarthula

TL;DR

This work tackles cross-domain depth and normal estimation for outdoor omnidirectional imagery by training on a synthetic dataset (OmniHorizon) and applying to real-world scenes. It introduces UBotNet, a hybrid UNet-Bottleneck Transformer that captures both local details and global context for consistent depth and normals, along with a lighter UBotNet Lite version. The OmniHorizon dataset provides rich dynamic outdoor environments with varied lighting and participants to support robust cross-domain learning, and the approach demonstrates strong sim-to-real transfer and real-world performance, while also outlining limitations and avenues for improvement. Overall, the combination of synthetic outdoor data and the UBotNet architecture advances reliable monocular omnidirectional scene understanding with practical implications for AR/VR, SLAM, and autonomous perception.

Abstract

We present a cross-domain inference technique that learns from synthetic data to estimate depth and normals for in-the-wild omnidirectional 3D scenes encountered in real-world uncontrolled settings. To this end, we introduce UBotNet, an architecture that combines UNet and Bottleneck Transformer elements to predict consistent scene normals and depth. We also introduce the OmniHorizon synthetic dataset containing 24,335 omnidirectional images that represent a wide variety of outdoor environments, including buildings, streets, and diverse vegetation. This dataset is generated from expansive, lifelike virtual spaces and encompasses dynamic scene elements, such as changing lighting conditions, different times of day, pedestrians, and vehicles. Our experiments show that UBotNet achieves significantly improved accuracy in depth estimation and normal estimation compared to existing models. Lastly, we validate cross-domain synthetic-to-real depth and normal estimation on real outdoor images using UBotNet trained solely on our synthetic OmniHorizon dataset, demonstrating the potential of both the synthetic dataset and the proposed network for real-world scene understanding applications.

Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 19 figures, 7 tables)

This paper contains 29 sections, 1 equation, 19 figures, 7 tables.

Introduction
Related Work
Real Datasets
Synthetic Datasets
Monocular Omnidirectional Depth and Normals
Dataset
Scene Attributes
Dynamic Lighting
Dynamic Scene Participants
Neural Cross-domain Inference
UBotNet Architecture
Network Training and Experiments
Discussion and Evaluation
Benchmark Results on OmniHorizon
Ablation Study
...and 14 more sections

Figures (19)

Figure 1: Synthetic to Real cross-domain inference. The proposed synthetic OmniHorizon dataset and the UBotNet performs cross-domain inference of scene-consistent depth and normals on real-world images captured outdoors in-the-wild.
Figure 2: Overview of the OmniHorizon dataset. Our dataset models urban areas, vegetation and various outdoor components with pedestrians and vehicles with varied depth distribution across the scenes as visualized.
Figure 3: Dynamic lighting and varying time of day settings. a) The lighting of the scene is varied by modulating the directional light (sun) and secondary light source (skylight). b) Changes in the scene lighting condition achieved using the modulation of the light sources.
Figure 4: Examples of pedestrians in OmniHorizon dataset. a) virtual avatars sitting in a cafeteria, b) pedestrian walking on the street (spline path is highlighted in pink) and c) casual group hangout.
Figure 5: Proposed UBotNet architecture. UBotNet is a hybrid architecture based on UNet and Bottleneck Transformer (BoTNet). Anti-aliased max pooling is used for the pooling operation. The transformer block is placed in the middle of the encoder and decoder paths of the UNet. UBotNet Lite uses separable convolutions in place of standard convolution layers; otherwise, it is identical to UBotNet. A simplified illustration of BoTNet is also shown which contains Multi-Head Self-Attention (MHSA) for learning global context.
...and 14 more figures

Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding

TL;DR

Abstract

Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (19)