GAM-Depth: Self-Supervised Indoor Depth Estimation Leveraging a Gradient-Aware Mask and Semantic Constraints

Anqi Cheng; Zhiyuan Yang; Haiyue Zhu; Kezhi Mao

GAM-Depth: Self-Supervised Indoor Depth Estimation Leveraging a Gradient-Aware Mask and Semantic Constraints

Anqi Cheng, Zhiyuan Yang, Haiyue Zhu, Kezhi Mao

TL;DR

GAM-Depth tackles indoor self-supervised depth estimation where textureless surfaces weaken photometric supervision and object boundaries cause depth inaccuracies. It introduces a gradient-aware mask that weights the photometric loss $L_{p}$ via $M_{gra}$ depending on gradient magnitude $m$, forming $L_{gra}$, and enforces semantic consistency through a shared encoder with a proxy semantic model to produce $L_{seg}$. The final objective combines $L_{gra}$, $L_{seg}$, and regularizers, yielding state-of-the-art results on NYUv2 and improved generalization to ScanNet and InteriorNet, while producing smoother depths in textureless regions and crisper depth boundaries. This approach has practical implications for indoor robotics and navigation, where reliable depth maps across varied textures are critical.

Abstract

Self-supervised depth estimation has evolved into an image reconstruction task that minimizes a photometric loss. While recent methods have made strides in indoor depth estimation, they often produce inconsistent depth estimation in textureless areas and unsatisfactory depth discrepancies at object boundaries. To address these issues, in this work, we propose GAM-Depth, developed upon two novel components: gradient-aware mask and semantic constraints. The gradient-aware mask enables adaptive and robust supervision for both key areas and textureless regions by allocating weights based on gradient magnitudes.The incorporation of semantic constraints for indoor self-supervised depth estimation improves depth discrepancies at object boundaries, leveraging a co-optimization network and proxy semantic labels derived from a pretrained segmentation model. Experimental studies on three indoor datasets, including NYUv2, ScanNet, and InteriorNet, show that GAM-Depth outperforms existing methods and achieves state-of-the-art performance, signifying a meaningful step forward in indoor depth estimation. Our code will be available at https://github.com/AnqiCheng1234/GAM-Depth.

GAM-Depth: Self-Supervised Indoor Depth Estimation Leveraging a Gradient-Aware Mask and Semantic Constraints

TL;DR

via

depending on gradient magnitude

, forming

, and enforces semantic consistency through a shared encoder with a proxy semantic model to produce

. The final objective combines

, and regularizers, yielding state-of-the-art results on NYUv2 and improved generalization to ScanNet and InteriorNet, while producing smoother depths in textureless regions and crisper depth boundaries. This approach has practical implications for indoor robotics and navigation, where reliable depth maps across varied textures are critical.

Abstract

Paper Structure (22 sections, 8 equations, 5 figures, 5 tables)

This paper contains 22 sections, 8 equations, 5 figures, 5 tables.

INTRODUCTION
RELATED WORK
Self-supervised Monocular Depth Estimation
Semantic Constraints for Depth Estimation
Method
Method Overview
Gradient-Aware Mask
Semantic Constraints
Final Training Loss
EXPERIMENTS
Datasets
NYUv2
ScanNet
InteriorNet
Implementation Details
...and 7 more sections

Figures (5)

Figure 1: Depth estimation comparisons. (a) Input, (b) StructDepth li2021structdepth, (c) Ours, (d) Ground truth, (e) 1. our proposed gradient-aware mask within the blue box region. The gray level of each pixel corresponds to its weight, with closer to white indicating higher weights. 2. the binary mask of keypoints li2021structdepth with textureless regions being neglected. 3. Semantic constraints with colors representing proxy semantic labels.
Figure 2: Overall training pipeline of GAM-Depth. Given a target frame $\emph{I}_t$, DepthNet estimates its depth map $\emph{D}_t$ and SegNet predicts its semantic labels $\emph{S}_t$. DepthNet shares its encoder with SegNet. ProxySegNet predicts proxy semantic labels $\emph{S}_t^{proxy}$ to supervise the training of GAM-Depth through segmentation loss $L_{seg}$. The reference frame $\emph{I}_s$ is warped to $\emph{I}_t$'s view through PoseNet and ResPoseNet proposed by MonoIndoor ji2021monoindoor. Our gradient-aware mask $\emph{M}_{gra}$ is generated by a gradient detection method. Finally, a gradient-aware photometric loss $L_{gra}$ is calculated as the multiplication of the photometric loss $L_{p}$ between $\emph{I}_t$ and the warped frame $\emph{I}_{s \rightarrow t}$ and $\emph{M}_{gra}$.
Figure 3: Different weights assignment between our gradient-aware mask $\emph{M}_{gra}$ and keypoints-only photometric loss. (a) $\emph{M}_{gra}$ allocates different weights to various pixels based on their gradient magnitudes $\emph{m}$, providing more robust supervision for both textureless regions and key areas. (b) P2Net yu2020p and StructDepth li2021structdepth allocate weight of 1 to keypoints and 0 to non-keypoints.
Figure 4: Comparisons between our gradient-aware mask $\emph{M}_{gra}$ and keypoints detected by DSO engel2017direct. The gray level of each pixel corresponds to its weight, with closer to white indicating higher weights. Our $\emph{M}_{gra}$ provides adaptive supervision at both textureless (indicated by red boxes) and texture-rich regions (indicated by blue boxes). Weights of textureless regions are completely excluded in P2Net yu2020p and StructDepth li2021structdepth.
Figure 5: Qualitative results on NYUv2 silberman2012indoor. RGB images, P2Net yu2020p, StructDepth li2021structdepth, our predictions, and ground truth depth maps are presented for comparison. GAM-Depth obtains more accurate results at object boundaries and textureless regions, as indicated by the red circles.

GAM-Depth: Self-Supervised Indoor Depth Estimation Leveraging a Gradient-Aware Mask and Semantic Constraints

TL;DR

Abstract

GAM-Depth: Self-Supervised Indoor Depth Estimation Leveraging a Gradient-Aware Mask and Semantic Constraints

Authors

TL;DR

Abstract

Table of Contents

Figures (5)