Table of Contents
Fetching ...

Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

Lintao Xu, Yinghao Wang, Chaohui Wang

TL;DR

This work introduces MoDOT, a two-stage, multi-task framework that jointly estimates occlusion boundaries (OBs) and monocular depth by enforcing their geometric coupling. Central to MoDOT are the Cross-Attention Strip Module (CASM) for cross-modal feature fusion and the OB-Depth Constraint Loss (OBDCL) that aligns depth discontinuities with OBs, trained on a synthetic OB-Hypersim dataset and evaluated on OB-FUTURE, NYUD-v2, and zero-shot real-world scenes. The model achieves state-of-the-art performance for depth and OB accuracy across synthetic and real datasets, and demonstrates strong generalization without domain adaptation. The work highlights the mutual benefits of integrating OBs and depth through carefully designed architecture and loss functions, while noting limitations due to synthetic-real domain gaps and suggesting directions for improved cross-domain OB generalization.

Abstract

Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects. This task is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as Occlusion Boundaries (OBs) provide critical geometric cues for resolving depth ambiguities, while depth can conversely refine occlusion reasoning. In this paper, we aim to systematically model and exploit this mutually beneficial relationship. To this end, we propose MoDOT, a novel framework for joint estimation of depth and OBs, which incorporates a new Cross-Attention Strip Module (CASM) to leverage mid-level OB features for depth prediction, and a novel OB-Depth Constraint Loss (OBDCL) to enforce geometric consistency. To facilitate this study, we contribute OB-Hypersim, a large-scale photorealistic dataset with precise depth and self-occlusion-handled OB annotations. Extensive experiments on two synthetic datasets and NYUD-v2 demonstrate that MoDOT achieves significantly better performance than single-task baselines and multi-task competitors. Furthermore, models trained solely on our synthetic data demonstrate strong generalization to real-world scenes without fine-tuning, producing depth maps with sharper boundaries and improved geometric fidelity. Collectively, these results underscore the significant benefits of jointly modeling OBs and depth. Code and resources are available at https://github.com/xul-ops/MoDOT.

Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

TL;DR

This work introduces MoDOT, a two-stage, multi-task framework that jointly estimates occlusion boundaries (OBs) and monocular depth by enforcing their geometric coupling. Central to MoDOT are the Cross-Attention Strip Module (CASM) for cross-modal feature fusion and the OB-Depth Constraint Loss (OBDCL) that aligns depth discontinuities with OBs, trained on a synthetic OB-Hypersim dataset and evaluated on OB-FUTURE, NYUD-v2, and zero-shot real-world scenes. The model achieves state-of-the-art performance for depth and OB accuracy across synthetic and real datasets, and demonstrates strong generalization without domain adaptation. The work highlights the mutual benefits of integrating OBs and depth through carefully designed architecture and loss functions, while noting limitations due to synthetic-real domain gaps and suggesting directions for improved cross-domain OB generalization.

Abstract

Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects. This task is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as Occlusion Boundaries (OBs) provide critical geometric cues for resolving depth ambiguities, while depth can conversely refine occlusion reasoning. In this paper, we aim to systematically model and exploit this mutually beneficial relationship. To this end, we propose MoDOT, a novel framework for joint estimation of depth and OBs, which incorporates a new Cross-Attention Strip Module (CASM) to leverage mid-level OB features for depth prediction, and a novel OB-Depth Constraint Loss (OBDCL) to enforce geometric consistency. To facilitate this study, we contribute OB-Hypersim, a large-scale photorealistic dataset with precise depth and self-occlusion-handled OB annotations. Extensive experiments on two synthetic datasets and NYUD-v2 demonstrate that MoDOT achieves significantly better performance than single-task baselines and multi-task competitors. Furthermore, models trained solely on our synthetic data demonstrate strong generalization to real-world scenes without fine-tuning, producing depth maps with sharper boundaries and improved geometric fidelity. Collectively, these results underscore the significant benefits of jointly modeling OBs and depth. Code and resources are available at https://github.com/xul-ops/MoDOT.

Paper Structure

This paper contains 10 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of zero-shot results on real-world example from iBims-1 koch2018ibims1. Significant self-occlusion boundaries are highlighted in red. The proposed multi-task framework, MoDOT, is jointly trained to predict both depth and OBs. Compared to the single-task depth baseline yuan2022newcrfs, which is trained solely with depth supervision, MoDOT produces more accurate and structurally consistent depth maps on unseen scenes, particularly around OBs, while adding only 11M parameters (4% of the depth baseline). Compared to the single-task OB baseline, MoDOT tends to predict more OBs (resulting in higher recall) but introduces additional noise, leading to a slight drop in F-score. All methods are trained exclusively on our proposed synthetic dataset.
  • Figure 2: Overall framework of MoDOT. On the left is the main model structure of MoDOT used in the first training stage; on the right are the detailed structures of CASM, EIP, and Second Stage Refinement (SSR). Our method uses a shared encoder that extracts hierarchical image features, which are then separately fed into the depth decoder and the OB decoder. The designed CASM facilitates communication between the encoder and the two decoders, enhancing feature interaction for both depth and OB predictions. A PPM aggregates global high-level context for the first-stage depth decoder, while the proposed EIP captures enhanced local low-level image features for the final OB decoder and the subsequent SSR. SSR is incorporated to further refine depth and OB features on full-resolution images.
  • Figure 3: Qualitative comparisons on the synthetic OB-Hypersim and the real NYUD-v2 datasets. Our MoDOT produces more accurate depth maps than the single-task depth baseline and achieves higher-recall OB predictions compared to the single-task OB baseline.
  • Figure 4: Zero-shot comparison on real scenes from the iBims-1 koch2018ibims1 and DIODE vasiljevic2019diode datasets, where the single-task baselines and our MoDOT are trained exclusively on proposed synthetic OB-Hypersim.