Table of Contents
Fetching ...

Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

Alex Kendall, Yarin Gal, Roberto Cipolla

TL;DR

The paper addresses the challenge of balancing losses in multi-task learning for scene understanding by introducing a probabilistic weighting scheme based on homoscedastic (task-dependent) uncertainty. It derives a joint likelihood for regression and classification tasks, enabling the model to learn per-task weights via learnable noise parameters and a stable log-variance representation. The approach is implemented in a unified encoder-decoder architecture that performs semantic, instance, and depth predictions from monocular images, achieving superior performance over single-task models and naïve loss weighting. The findings highlight dynamic, data-driven task weighting during training and demonstrate practical benefits for real-time, multi-task scene understanding systems.

Abstract

Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.

Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

TL;DR

The paper addresses the challenge of balancing losses in multi-task learning for scene understanding by introducing a probabilistic weighting scheme based on homoscedastic (task-dependent) uncertainty. It derives a joint likelihood for regression and classification tasks, enabling the model to learn per-task weights via learnable noise parameters and a stable log-variance representation. The approach is implemented in a unified encoder-decoder architecture that performs semantic, instance, and depth predictions from monocular images, achieving superior performance over single-task models and naïve loss weighting. The findings highlight dynamic, data-driven task weighting during training and demonstrate practical benefits for real-time, multi-task scene understanding systems.

Abstract

Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.

Paper Structure

This paper contains 14 sections, 10 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Multi-task deep learning. We derive a principled way of combining multiple regression and classification loss functions for multi-task learning. Our architecture takes a single monocular RGB image as input and produces a pixel-wise classification, an instance semantic segmentation and an estimate of per pixel depth. Multi-task learning can improve accuracy over separately trained models because cues from one task, such as depth, are used to regularize and improve the generalization of another domain, such as segmentation.
  • Figure 2: Learning multiple tasks improves the model's representation and individual task performance. These figures and tables illustrate the advantages of multi-task learning for (a) semantic classification and depth regression and (b) instance and depth regression. Performance of the model in individual tasks is seen at both edges of the plot where $w=0$ and $w=1$. For some balance of weightings between each task, we observe improved performance for both tasks. All models were trained with a learning rate of $0.01$ with the respective weightings applied to the losses using the loss function in (\ref{['eqn:basic_loss']}). Results are shown using the Tiny CityScapes validation dataset using a down-sampled resolution of $128\times256$.
  • Figure 3: Instance centroid regression method. For each pixel, we regress a vector pointing to the instance's centroid. The loss is only computed over pixels which are from instances. We visualise (c) by representing colour as the orientation of the instance vector, and intensity as the magnitude of the vector.
  • Figure 4: This example shows two cars which are occluded by trees and lampposts, making the instance segmentation challenging. Our instance segmentation method can handle occlusions effectively. We can correctly handle segmentation masks which are split by occlusion, yet part of the same instance, by incorporating semantics and geometry.
  • Figure 5: Qualitative results for multi-task learning of geometry and semantics for road scene understanding. Results are shown on test images from the CityScapes dataset using our multi-task approach with a single network trained on all tasks. We observe that multi-task learning improves the smoothness and accuracy for depth perception because it learns a representation that uses cues from other tasks, such as segmentation (and vice versa).
  • ...and 4 more figures