Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning
Lukas Kirchdorfer, Cathrin Elich, Simon Kutsche, Heiner Stuckenschmidt, Lukas Schott, Jan M. Köhler
TL;DR
The paper tackles the problem of balancing task losses in multi-task learning by deriving Soft Optimal Uncertainty Weighting (UW-SO), which uses an analytically optimal inverse-loss weighting (UW-O) tempered by a softmax with temperature $T$ to produce normalized task weights. UW-SO aims to match the performance of the combinatorially expensive Scalarization method while remaining computationally efficient, and it demonstrates strong, consistent improvements across NYUv2, Cityscapes, and CelebA with multiple architectures. Key findings show that untuned loss weighting offers substantial gains on smaller models but that larger networks diminish these gains, while learning-rate tuning is critical and weight-decay effects are relatively modest; UW-SO also mitigates UW’s inertia and overfitting tendencies. The work provides a practical, scalable weighting scheme for practitioners and outlines a comprehensive benchmark, including ablations on temperature, oscillations, and cross-dataset performance, with guidance on hyperparameter strategies and future directions for automating temperature selection. The proposed method thereby contributes a theoretically grounded, easily tunable approach to loss balancing that improves MTL performance without incurring the heavy computational cost of exhaustive weight searches.
Abstract
With the rise of neural networks in various domains, multi-task learning (MTL) gained significant relevance. A key challenge in MTL is balancing individual task losses during neural network training to improve performance and efficiency through knowledge sharing across tasks. To address these challenges, we propose a novel task-weighting method by building on the most prevalent approach of Uncertainty Weighting and computing analytically optimal uncertainty-based weights, normalized by a softmax function with tunable temperature. Our approach yields comparable results to the combinatorially prohibitive, brute-force approach of Scalarization while offering a more cost-effective yet high-performing alternative. We conduct an extensive benchmark on various datasets and architectures. Our method consistently outperforms six other common weighting methods. Furthermore, we report noteworthy experimental findings for the practical application of MTL. For example, larger networks diminish the influence of weighting methods, and tuning the weight decay has a low impact compared to the learning rate.
