Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning
Lintao Xu, Yinghao Wang, Chaohui Wang
TL;DR
This work introduces MoDOT, a two-stage, multi-task framework that jointly estimates occlusion boundaries (OBs) and monocular depth by enforcing their geometric coupling. Central to MoDOT are the Cross-Attention Strip Module (CASM) for cross-modal feature fusion and the OB-Depth Constraint Loss (OBDCL) that aligns depth discontinuities with OBs, trained on a synthetic OB-Hypersim dataset and evaluated on OB-FUTURE, NYUD-v2, and zero-shot real-world scenes. The model achieves state-of-the-art performance for depth and OB accuracy across synthetic and real datasets, and demonstrates strong generalization without domain adaptation. The work highlights the mutual benefits of integrating OBs and depth through carefully designed architecture and loss functions, while noting limitations due to synthetic-real domain gaps and suggesting directions for improved cross-domain OB generalization.
Abstract
Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects. This task is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as Occlusion Boundaries (OBs) provide critical geometric cues for resolving depth ambiguities, while depth can conversely refine occlusion reasoning. In this paper, we aim to systematically model and exploit this mutually beneficial relationship. To this end, we propose MoDOT, a novel framework for joint estimation of depth and OBs, which incorporates a new Cross-Attention Strip Module (CASM) to leverage mid-level OB features for depth prediction, and a novel OB-Depth Constraint Loss (OBDCL) to enforce geometric consistency. To facilitate this study, we contribute OB-Hypersim, a large-scale photorealistic dataset with precise depth and self-occlusion-handled OB annotations. Extensive experiments on two synthetic datasets and NYUD-v2 demonstrate that MoDOT achieves significantly better performance than single-task baselines and multi-task competitors. Furthermore, models trained solely on our synthetic data demonstrate strong generalization to real-world scenes without fine-tuning, producing depth maps with sharper boundaries and improved geometric fidelity. Collectively, these results underscore the significant benefits of jointly modeling OBs and depth. Code and resources are available at https://github.com/xul-ops/MoDOT.
