Table of Contents
Fetching ...

Efficient Multi-task Uncertainties for Joint Semantic Segmentation and Monocular Depth Estimation

Steven Landgraf, Markus Hillemann, Theodor Kapler, Markus Ulrich

TL;DR

The paper tackles the challenge of probabilistic uncertainty quantification in multi-task vision, specifically joint semantic segmentation and monocular depth estimation, where prior uncertainty methods are often computationally prohibitive. It introduces EMUFormer, a two-step knowledge-distillation framework that transfers high-quality uncertainties from a Deep Ensemble teacher to an efficient SegDepthFormer student, using a KL loss for segmentation and RMSLE for depth within a joint objective: $\mathcal{L} = \mathcal{L}_{CE} + w_1 \mathcal{L}_{GNLL} + w_2 \mathcal{L}_{KL} + w_3 \mathcal{L}_{RMSLE}$ with $w_1=w_3=1$, $w_2=10$. Empirically, EMUFormer achieves state-of-the-art results on Cityscapes and NYUv2 for both tasks while producing predictive uncertainties comparable to or better than a Deep Ensemble but with significantly lower computational cost, and multi-task learning generally improves uncertainty quality over solving tasks separately. The work demonstrates the practical viability of efficient, calibrated uncertainty estimates for real-time multi-task perception in autonomous driving and related domains, and suggests GNLL-based depth uncertainty distillation as a key driver of depth performance enhancements.

Abstract

Quantifying the predictive uncertainty emerged as a possible solution to common challenges like overconfidence or lack of explainability and robustness of deep neural networks, albeit one that is often computationally expensive. Many real-world applications are multi-modal in nature and hence benefit from multi-task learning. In autonomous driving, for example, the joint solution of semantic segmentation and monocular depth estimation has proven to be valuable. In this work, we first combine different uncertainty quantification methods with joint semantic segmentation and monocular depth estimation and evaluate how they perform in comparison to each other. Additionally, we reveal the benefits of multi-task learning with regard to the uncertainty quality compared to solving both tasks separately. Based on these insights, we introduce EMUFormer, a novel student-teacher distillation approach for joint semantic segmentation and monocular depth estimation as well as efficient multi-task uncertainty quantification. By implicitly leveraging the predictive uncertainties of the teacher, EMUFormer achieves new state-of-the-art results on Cityscapes and NYUv2 and additionally estimates high-quality predictive uncertainties for both tasks that are comparable or superior to a Deep Ensemble despite being an order of magnitude more efficient.

Efficient Multi-task Uncertainties for Joint Semantic Segmentation and Monocular Depth Estimation

TL;DR

The paper tackles the challenge of probabilistic uncertainty quantification in multi-task vision, specifically joint semantic segmentation and monocular depth estimation, where prior uncertainty methods are often computationally prohibitive. It introduces EMUFormer, a two-step knowledge-distillation framework that transfers high-quality uncertainties from a Deep Ensemble teacher to an efficient SegDepthFormer student, using a KL loss for segmentation and RMSLE for depth within a joint objective: with , . Empirically, EMUFormer achieves state-of-the-art results on Cityscapes and NYUv2 for both tasks while producing predictive uncertainties comparable to or better than a Deep Ensemble but with significantly lower computational cost, and multi-task learning generally improves uncertainty quality over solving tasks separately. The work demonstrates the practical viability of efficient, calibrated uncertainty estimates for real-time multi-task perception in autonomous driving and related domains, and suggests GNLL-based depth uncertainty distillation as a key driver of depth performance enhancements.

Abstract

Quantifying the predictive uncertainty emerged as a possible solution to common challenges like overconfidence or lack of explainability and robustness of deep neural networks, albeit one that is often computationally expensive. Many real-world applications are multi-modal in nature and hence benefit from multi-task learning. In autonomous driving, for example, the joint solution of semantic segmentation and monocular depth estimation has proven to be valuable. In this work, we first combine different uncertainty quantification methods with joint semantic segmentation and monocular depth estimation and evaluate how they perform in comparison to each other. Additionally, we reveal the benefits of multi-task learning with regard to the uncertainty quality compared to solving both tasks separately. Based on these insights, we introduce EMUFormer, a novel student-teacher distillation approach for joint semantic segmentation and monocular depth estimation as well as efficient multi-task uncertainty quantification. By implicitly leveraging the predictive uncertainties of the teacher, EMUFormer achieves new state-of-the-art results on Cityscapes and NYUv2 and additionally estimates high-quality predictive uncertainties for both tasks that are comparable or superior to a Deep Ensemble despite being an order of magnitude more efficient.
Paper Structure (20 sections, 13 equations, 5 figures, 10 tables)

This paper contains 20 sections, 13 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: A schematic overview of the SegFormer xie2021segformer architecture. The model consists of two main modules: A hierarchical Transformer-based encoder that generates high-resolution coarse features and low-resolution fine features and a lightweight all-MLP segmentation decoder.
  • Figure 2: A schematic overview of our DepthFormer architecture. Being derived from SegFormer xie2021segformer, it consists of two main modules: A hierarchical Transformer-based encoder that generates high-resolution coarse features and low-resolution fine features and a lightweight all-MLP depth decoder.
  • Figure 3: A schematic overview of the SegDepthFormer architecture. The model combines the SegFormer xie2021segformer architecture with a lightweight all-MLP depth decoder.
  • Figure 4: A schematic overview of EMUFormer. In comparison to our proposed SegDepthFormer, EMUFormer utilizes two additional losses that distill the predictive uncertainties of the teacher into the student model.
  • Figure 5: Qualitative examples of our EMUFormer-B2 on the Cityscapes cordts2016CityscapesDataset (top) and NYUv2 silberman2012indoor (bottom) datasets. Red rectangles are added to highlight interesting areas. Best viewed in color.