Table of Contents
Fetching ...

Depth Edge Alignment Loss: DEALing with Depth in Weakly Supervised Semantic Segmentation

Patrick Schmidt, Vasileios Belagiannis, Lazaros Nalpantidis

TL;DR

Depth Edge Alignment Loss (DEAL) addresses the challenge of pixel-level labeling in weakly supervised semantic segmentation by leveraging depth information to align CAM boundaries with depth edges. The method defines a depth-edge alignment loss, $\mathcal{L}_{\mathrm{edge}}$, computed from Sobel-derived CAM and depth edge activations $a'$ and $d'$, where $a' = \tanh(\mu + \log\left(\frac{a}{1-a}\right))$, $d' = \tanh(\mu + \log\left(\frac{d}{1-d}\right))$ and $\mathcal{L}_{\mathrm{edge}} = -\frac{1}{HW}\sum_{ij}\frac{1}{\sum_k y_k}\sum_k y_k a'_{k,ij} d'_{ij}$ with $\mu=2.5$. Incorporating DEAL on top of CAM-based WSSS (e.g., WeakTr and SEAM) and optional ISL/FSL yields consistent mIoU improvements across VOC, COCO, and HOPE, including robustness to depth noise. The framework is model-agnostic and accommodates noisy real-world depth data, highlighting practical benefits for robotic perception and suggesting avenues for integrating depth with vision-language models in future work.

Abstract

Autonomous robotic systems applied to new domains require an abundance of expensive, pixel-level dense labels to train robust semantic segmentation models under full supervision. This study proposes a model-agnostic Depth Edge Alignment Loss to improve Weakly Supervised Semantic Segmentation models across different datasets. The methodology generates pixel-level semantic labels from image-level supervision, avoiding expensive annotation processes. While weak supervision is widely explored in traditional computer vision, our approach adds supervision with pixel-level depth information, a modality commonly available in robotic systems. We demonstrate how our approach improves segmentation performance across datasets and models, but can also be combined with other losses for even better performance, with improvements up to +5.439, +1.274 and +16.416 points in mean Intersection over Union on the PASCAL VOC / MS COCO validation, and the HOPE static onboarding split, respectively. Our code is made publicly available at https://github.com/DTU-PAS/DEAL.

Depth Edge Alignment Loss: DEALing with Depth in Weakly Supervised Semantic Segmentation

TL;DR

Depth Edge Alignment Loss (DEAL) addresses the challenge of pixel-level labeling in weakly supervised semantic segmentation by leveraging depth information to align CAM boundaries with depth edges. The method defines a depth-edge alignment loss, , computed from Sobel-derived CAM and depth edge activations and , where , and with . Incorporating DEAL on top of CAM-based WSSS (e.g., WeakTr and SEAM) and optional ISL/FSL yields consistent mIoU improvements across VOC, COCO, and HOPE, including robustness to depth noise. The framework is model-agnostic and accommodates noisy real-world depth data, highlighting practical benefits for robotic perception and suggesting avenues for integrating depth with vision-language models in future work.

Abstract

Autonomous robotic systems applied to new domains require an abundance of expensive, pixel-level dense labels to train robust semantic segmentation models under full supervision. This study proposes a model-agnostic Depth Edge Alignment Loss to improve Weakly Supervised Semantic Segmentation models across different datasets. The methodology generates pixel-level semantic labels from image-level supervision, avoiding expensive annotation processes. While weak supervision is widely explored in traditional computer vision, our approach adds supervision with pixel-level depth information, a modality commonly available in robotic systems. We demonstrate how our approach improves segmentation performance across datasets and models, but can also be combined with other losses for even better performance, with improvements up to +5.439, +1.274 and +16.416 points in mean Intersection over Union on the PASCAL VOC / MS COCO validation, and the HOPE static onboarding split, respectively. Our code is made publicly available at https://github.com/DTU-PAS/DEAL.

Paper Structure

This paper contains 17 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: A graphical summary of the intuition behind our proposed Depth Edge Alignment Loss. Alignment of object boundaries in RGB images and edges extracted at depth discontinuities can improve the results of WSSS.
  • Figure 2: Detailed graphical overview of our method. The upper strand shows the depth information pipeline, and the lower strand shows the CAM generation pipeline. From both CAM and depth, we extract edges $a$ and $d$, use a $\tanh$ activation function $\eta$, calculate the alignment map through channel-wise per-element multiplication and then aggregate those into $\mathcal{L}_{\mathrm{deal}}$. Note that the only trainable module is $f_\theta(x)$, which can be any trainable CAM-based WSSS framework.
  • Figure 3: Qualitative results for HOPE, shown on three consecutive frames with a gap of 10 frames in between, from left to right. The top row shows the input RGB images, overlaid with the thresholded CAMs in yellow and magenta for baseline and DEAL respectively. The target object is the bottle being manipulated by the hands. The middle and bottom rows show the CAMs for baseline and DEAL, respectively.
  • Figure 4: Qualitative results of WeakTr trained with the different variants presented in Table \ref{['tab:baseline_results_maxmIoU']}. Note that the predicted masks are obtained by CAM thresholding without any post-processing.