Table of Contents
Fetching ...

Examining Common Paradigms in Multi-Task Learning

Cathrin Elich, Lukas Kirchdorfer, Jan M. Köhler, Lukas Schott

TL;DR

This paper investigates why multi-task learning (MTL) methods often underperform compared with single-task learning (STL) by examining two key paradigms: optimizer choice and gradient interactions. It demonstrates that the Adam optimizer frequently provides a stronger baseline in MTL and derives theoretical invariances linking loss scaling to optimization dynamics, connecting UW and Adam to how losses are weighted and updated. The study further shows that gradient magnitude differences across tasks and samples largely drive conflicts, challenging the focus on gradient alignment as the sole culprit in MTL failing to beat STL. Across standard CV datasets (Cityscapes, NYUv2, CelebA), Adam-based configurations dominate the Pareto front more often than SGD-based ones, prompting a reconsideration of MTO method claims and highlighting the value of cross-pollination between STL and MTL techniques. Overall, the findings advocate optimizer-aware, cross-paradigm approaches and deeper exploration of capacity allocation to improve multi-task performance.

Abstract

While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we investigate paradigms in MTL in the context of STL: First, the impact of the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL empirically in various experiments. To further investigate Adam's effectiveness, we theoretical derive a partial loss-scale invariance under mild assumptions. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.

Examining Common Paradigms in Multi-Task Learning

TL;DR

This paper investigates why multi-task learning (MTL) methods often underperform compared with single-task learning (STL) by examining two key paradigms: optimizer choice and gradient interactions. It demonstrates that the Adam optimizer frequently provides a stronger baseline in MTL and derives theoretical invariances linking loss scaling to optimization dynamics, connecting UW and Adam to how losses are weighted and updated. The study further shows that gradient magnitude differences across tasks and samples largely drive conflicts, challenging the focus on gradient alignment as the sole culprit in MTL failing to beat STL. Across standard CV datasets (Cityscapes, NYUv2, CelebA), Adam-based configurations dominate the Pareto front more often than SGD-based ones, prompting a reconsideration of MTO method claims and highlighting the value of cross-pollination between STL and MTL techniques. Overall, the findings advocate optimizer-aware, cross-paradigm approaches and deeper exploration of capacity allocation to improve multi-task performance.

Abstract

While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we investigate paradigms in MTL in the context of STL: First, the impact of the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL empirically in various experiments. To further investigate Adam's effectiveness, we theoretical derive a partial loss-scale invariance under mild assumptions. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.
Paper Structure (34 sections, 26 equations, 12 figures, 12 tables)

This paper contains 34 sections, 26 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Toy task experiment from CAGrad liu_cagrad_2021 for different learning rates and optimizers. Consistent with results from xin_mto-even-help_2022, we observe that the choice of the learning rate is crucial even for this toy optimization problem. Moreover, it becomes apparent, that selecting Adam over simple gradient decent (GD) yields superior results. The contour lines depict the 2D loss landscape; the optimization trajectories are colored from red to yellow for 100k iteration steps from three different starting points (seeds).
  • Figure 2: Parallel coordinate plot over all experiments on Cityscapes. We distinguish between experiments using SGD+mom and Adam optimizer. Experiments that reached Pareto front performance are drawn with higher saturation. We observe that Adam clearly outperforms the usage of SGD+mom.
  • Figure 3: High intra-task diversity can mimic MTL.
  • Figure 4: Gradient similarities and conflicts for different datasets and network architectures over training epochs. For each dataset and network combination, we report (from left to right) gradient cosine similarity, gradient magnitude similarity, and the ratio of conflicting gradient parameters w.r.t. gradient pairs corresponding to either inter-samples (fixed task) or inter-tasks (fixed sample). We report mean (solid line), standard deviation (shaded area), upper ($97.5\%$) and lower ($2.5\%$) percentile (dotted line) within an epoch. Overall, the direction conflicts are similar (first / last column), whereas the magnitude differences are more pronounced in MTL (middle column).
  • Figure A1: Invariances within the neural network for a frozen backbone. Comparing the effect of loss-scalings in a toy experiment with two tasks. For each optimizer and loss weighting combination, we run two settings with a) loss L1 and loss L2 are equally weighted or b) L1 is scaled by 10x and L2 by 0.1. For each setting, we measure the SGD + momentum and Adam optimizer with no post weighting (EW) and SGD + momentum with optimimal uncertainty weighting. We show the scaled losses, gradient magnitudes, and gradient update magnitudes in the the two task heads and keep the backbone frozen. While SGD does not offer any loss-scaling invariance, Adam makes the gradient updates of the head parameters invariant to scales confirming our derivation (red lines overlap in lowest row). Equivalently, for UW-O we also observe the theoretically derived invariances (green lines overlap in lowest row)
  • ...and 7 more figures