Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

Lukas Kirchdorfer; Cathrin Elich; Simon Kutsche; Heiner Stuckenschmidt; Lukas Schott; Jan M. Köhler

Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

Lukas Kirchdorfer, Cathrin Elich, Simon Kutsche, Heiner Stuckenschmidt, Lukas Schott, Jan M. Köhler

TL;DR

The paper tackles the problem of balancing task losses in multi-task learning by deriving Soft Optimal Uncertainty Weighting (UW-SO), which uses an analytically optimal inverse-loss weighting (UW-O) tempered by a softmax with temperature $T$ to produce normalized task weights. UW-SO aims to match the performance of the combinatorially expensive Scalarization method while remaining computationally efficient, and it demonstrates strong, consistent improvements across NYUv2, Cityscapes, and CelebA with multiple architectures. Key findings show that untuned loss weighting offers substantial gains on smaller models but that larger networks diminish these gains, while learning-rate tuning is critical and weight-decay effects are relatively modest; UW-SO also mitigates UW’s inertia and overfitting tendencies. The work provides a practical, scalable weighting scheme for practitioners and outlines a comprehensive benchmark, including ablations on temperature, oscillations, and cross-dataset performance, with guidance on hyperparameter strategies and future directions for automating temperature selection. The proposed method thereby contributes a theoretically grounded, easily tunable approach to loss balancing that improves MTL performance without incurring the heavy computational cost of exhaustive weight searches.

Abstract

With the rise of neural networks in various domains, multi-task learning (MTL) gained significant relevance. A key challenge in MTL is balancing individual task losses during neural network training to improve performance and efficiency through knowledge sharing across tasks. To address these challenges, we propose a novel task-weighting method by building on the most prevalent approach of Uncertainty Weighting and computing analytically optimal uncertainty-based weights, normalized by a softmax function with tunable temperature. Our approach yields comparable results to the combinatorially prohibitive, brute-force approach of Scalarization while offering a more cost-effective yet high-performing alternative. We conduct an extensive benchmark on various datasets and architectures. Our method consistently outperforms six other common weighting methods. Furthermore, we report noteworthy experimental findings for the practical application of MTL. For example, larger networks diminish the influence of weighting methods, and tuning the weight decay has a low impact compared to the learning rate.

Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

TL;DR

to produce normalized task weights. UW-SO aims to match the performance of the combinatorially expensive Scalarization method while remaining computationally efficient, and it demonstrates strong, consistent improvements across NYUv2, Cityscapes, and CelebA with multiple architectures. Key findings show that untuned loss weighting offers substantial gains on smaller models but that larger networks diminish these gains, while learning-rate tuning is critical and weight-decay effects are relatively modest; UW-SO also mitigates UW’s inertia and overfitting tendencies. The work provides a practical, scalable weighting scheme for practitioners and outlines a comprehensive benchmark, including ablations on temperature, oscillations, and cross-dataset performance, with guidance on hyperparameter strategies and future directions for automating temperature selection. The proposed method thereby contributes a theoretically grounded, easily tunable approach to loss balancing that improves MTL performance without incurring the heavy computational cost of exhaustive weight searches.

Abstract

Paper Structure (37 sections, 22 equations, 20 figures, 6 tables)

This paper contains 37 sections, 22 equations, 20 figures, 6 tables.

Introduction
Related Work
Background and Method
Weaknesses of Uncertainty Weighting and Scalarization
Our contribution: Soft Optimal Uncertainty Weighting
UW-O: Minimizing the total loss in UW
UW-SO: Soft Optimal Uncertainty Weighting
Experiments and Results
Experimental setup
Common loss weighting methods benchmark
Ablation Studies
Discussion and Conclusion
Derivation of UW-SO
L1 loss
L2 loss
...and 22 more sections

Figures (20)

Figure 1: Comparison of the learning procedure of task weights for a) semantic segmentation, b) depth estimation, and c) surface normals on NYUv2 using SegNet for two different initializations of $\sigma_t$ for UW. Equal starting parameter values in blue (UW S1) as in LibMTL; higher starting values (values of last epoch from a previous run) in orange (UW S2). The plots do not show $\sigma_t$ values, but actual task weights $\omega_t = \frac{1}{2\sigma_t^2}$. We plot the mean task weight of 5 random seeds with the standard deviation as shaded area.
Figure 2: Comparison of weight ratio and loss development of UW and UW-SO for the Bald task of CelebA. While UW shows superior training performance caused by putting a high weight on the task, it fails to generalize to unseen data (increasing test loss). UW-SO puts less weight on the task and alleviates the overfitting.
Figure 3: Performance of UW-SO for different choices of T on the validation data. a) shows a clear, reasonably flat minimum for Cityscapes that eases the optimization of $T$. b) shows the $\Delta_m$ development for different T values for NYUv2, indicating the optimal configuration already after around 100 epochs.
Figure 4: Boxplots over the std. dev. of the weighted NYUv2 (SegNet) task losses $\omega_k L_{k}$ of all batches from one epoch. Std. dev. over one epoch is one observation.
Figure A1: $\Delta_m$ scores on the test data for different choices of the learning rate with a fixed weight decay (for (a) and (b): $\lambda = 10^{-5}$; for (c): $\lambda = 10^{-4}$) according to our chosen line search approach, averaged over 5 runs. We show results for a) NYUv2 with SegNet, b) Cityscapes with SegNet, and c) CelebA with ResNet-18. In particular for (b) and (c) the optimal learning rate value highly varies across different weighting approaches, underlining the necessity to perform method-specific learning rate tuning.
...and 15 more figures

Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

TL;DR

Abstract

Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (20)