Table of Contents
Fetching ...

Towards Robust Monocular Depth Estimation in Non-Lambertian Surfaces

Junrui Zhang, Jiaqi Li, Yachuan Huang, Yiran Wang, Jinghong Zheng, Liao Shen, Zhiguo Cao

TL;DR

The paper tackles the robustness gap of monocular depth estimation in non-Lambertian (ToM) regions by introducing a targeted training framework that guides depth predictions in gradient space using non-Lambertian surface regional cues, paired with random tone-mapping augmentation to simulate diverse lighting. A key contribution is an optional lighting fusion module based on a Variational Autoencoder to fuse multi-exposure images, enabling the model to leverage favorable lighting for depth estimation. Trained on Hypersim with the ToM-aware loss $\mathcal{L}_{ToM}$ and a scale-shift-invariant loss $\mathcal{L}_{ssi}$, the approach yields strong zero-shot improvements on Booster and Mirror3D NYU data, and achieves state-of-the-art performance on ToM benchmarks such as TRICKY2024. The work demonstrates that gradient-domain supervision and lighting-aware data augmentation substantially improve ToM-region depth accuracy, though it notes limitations under extreme image corruption and overexposure.

Abstract

In the field of monocular depth estimation (MDE), many models with excellent zero-shot performance in general scenes emerge recently. However, these methods often fail in predicting non-Lambertian surfaces, such as transparent or mirror (ToM) surfaces, due to the unique reflective properties of these regions. Previous methods utilize externally provided ToM masks and aim to obtain correct depth maps through direct in-painting of RGB images. These methods highly depend on the accuracy of additional input masks, and the use of random colors during in-painting makes them insufficiently robust. We are committed to incrementally enabling the baseline model to directly learn the uniqueness of non-Lambertian surface regions for depth estimation through a well-designed training framework. Therefore, we propose non-Lambertian surface regional guidance, which constrains the predictions of MDE model from the gradient domain to enhance its robustness. Noting the significant impact of lighting on this task, we employ the random tone-mapping augmentation during training to ensure the network can predict correct results for varying lighting inputs. Additionally, we propose an optional novel lighting fusion module, which uses Variational Autoencoders to fuse multiple images and obtain the most advantageous input RGB image for depth estimation when multi-exposure images are available. Our method achieves accuracy improvements of 33.39% and 5.21% in zero-shot testing on the Booster and Mirror3D dataset for non-Lambertian surfaces, respectively, compared to the Depth Anything V2. The state-of-the-art performance of 90.75 in delta1.05 within the ToM regions on the TRICKY2024 competition test set demonstrates the effectiveness of our approach.

Towards Robust Monocular Depth Estimation in Non-Lambertian Surfaces

TL;DR

The paper tackles the robustness gap of monocular depth estimation in non-Lambertian (ToM) regions by introducing a targeted training framework that guides depth predictions in gradient space using non-Lambertian surface regional cues, paired with random tone-mapping augmentation to simulate diverse lighting. A key contribution is an optional lighting fusion module based on a Variational Autoencoder to fuse multi-exposure images, enabling the model to leverage favorable lighting for depth estimation. Trained on Hypersim with the ToM-aware loss and a scale-shift-invariant loss , the approach yields strong zero-shot improvements on Booster and Mirror3D NYU data, and achieves state-of-the-art performance on ToM benchmarks such as TRICKY2024. The work demonstrates that gradient-domain supervision and lighting-aware data augmentation substantially improve ToM-region depth accuracy, though it notes limitations under extreme image corruption and overexposure.

Abstract

In the field of monocular depth estimation (MDE), many models with excellent zero-shot performance in general scenes emerge recently. However, these methods often fail in predicting non-Lambertian surfaces, such as transparent or mirror (ToM) surfaces, due to the unique reflective properties of these regions. Previous methods utilize externally provided ToM masks and aim to obtain correct depth maps through direct in-painting of RGB images. These methods highly depend on the accuracy of additional input masks, and the use of random colors during in-painting makes them insufficiently robust. We are committed to incrementally enabling the baseline model to directly learn the uniqueness of non-Lambertian surface regions for depth estimation through a well-designed training framework. Therefore, we propose non-Lambertian surface regional guidance, which constrains the predictions of MDE model from the gradient domain to enhance its robustness. Noting the significant impact of lighting on this task, we employ the random tone-mapping augmentation during training to ensure the network can predict correct results for varying lighting inputs. Additionally, we propose an optional novel lighting fusion module, which uses Variational Autoencoders to fuse multiple images and obtain the most advantageous input RGB image for depth estimation when multi-exposure images are available. Our method achieves accuracy improvements of 33.39% and 5.21% in zero-shot testing on the Booster and Mirror3D dataset for non-Lambertian surfaces, respectively, compared to the Depth Anything V2. The state-of-the-art performance of 90.75 in delta1.05 within the ToM regions on the TRICKY2024 competition test set demonstrates the effectiveness of our approach.
Paper Structure (17 sections, 5 equations, 6 figures, 3 tables)

This paper contains 17 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Visual comparison of the baseline and our method. Fine-tuned from Depth Anything V2, our method improves zero-shot performance in non-Lambertian.
  • Figure 2: Overview of our pipeline. We train the large version of Depth Anything V2 on the Hypersim dataset with random tone-mapping augmentation. Non-Lambertian regional guidance computes and optimizes the loss in the non-Lambertian regions $M$ between the depth predicted and the ground truth $D^\star$ from the gradient domain. Images fusion is optional in inferring phase in case multi-exposure images are available.
  • Figure 3: Visual comparison of the baseline and our method. The first row consists of images from the NYU Depth Dataset V2. The second and third rows are the monocular depth estimation results from the baseline and our method, respectively. Our method outperforms the baseline as we explicitly and directly use the semtanics of non-Lambertian regions to guide the depth estimation.
  • Figure 4: Visual comparison of other baselines and our method. By conducting zero-shot evaluations between other baselines and ours on the Booster dataset, results show that our method outperforms the baselines in most cases, as we explicitly and directly use non-Lambertian regional guidance.
  • Figure 5: Visual results of ablation study on random tone-mapping augmentation and non-Lambertian surface regional guidance. Results consist of two scenes. For each scene, the RGB image is placed in the upper left, the depth estimated by the baseline is placed in the upper right, the depth estimated by our model trained using only random tone-mapping augmentation is placed in the lower left, and the depth estimated by our model trained using random tone-mapping augmentation and ToM surface regional guidance is placed in the lower right.
  • ...and 1 more figures