Can Optimization Trajectories Explain Multi-Task Transfer?

David Mueller; Mark Dredze; Nicholas Andrews

Can Optimization Trajectories Explain Multi-Task Transfer?

David Mueller, Mark Dredze, Nicholas Andrews

TL;DR

This work investigates why multi-task learning (MTL) yields mixed generalization by analyzing how MTL affects task optimization and whether optimization trajectories can explain transfer. Through extensive empirical analysis across multiple MT settings, it shows that transfer (positive or negative) appears early in training as a generalization gap at comparable training losses and persists thereafter. It evaluates trajectory-level factors such as sharpness, Fisher information, and gradient coherence, and finds they do not consistently explain transfer; similarly, specialized multi-task optimizers (SMTOs) fail to reliably improve MT transfer, despite affecting optimization. The results challenge the idea that general-purpose optimization strategies can universally address MT transfer, suggesting a shift toward understanding task relationships or exploring alternative meta-learning approaches.

Abstract

Despite the widespread adoption of multi-task training in deep learning, little is understood about how multi-task learning (MTL) affects generalization. Prior work has conjectured that the negative effects of MTL are due to optimization challenges that arise during training, and many optimization methods have been proposed to improve multi-task performance. However, recent work has shown that these methods fail to consistently improve multi-task generalization. In this work, we seek to improve our understanding of these failures by empirically studying how MTL impacts the optimization of tasks, and whether this impact can explain the effects of MTL on generalization. We show that MTL results in a generalization gap (a gap in generalization at comparable training loss) between single-task and multi-task trajectories early into training. However, we find that factors of the optimization trajectory previously proposed to explain generalization gaps in single-task settings cannot explain the generalization gaps between single-task and multi-task models. Moreover, we show that the amount of gradient conflict between tasks is correlated with negative effects to task optimization, but is not predictive of generalization. Our work sheds light on the underlying causes for failures in MTL and, importantly, raises questions about the role of general purpose multi-task optimization algorithms.

Can Optimization Trajectories Explain Multi-Task Transfer?

TL;DR

Abstract

Paper Structure (29 sections, 12 equations, 10 figures, 1 table)

This paper contains 29 sections, 12 equations, 10 figures, 1 table.

Introduction
Background and Preliminaries
Multi-Task Optimization and Transfer
Experimental Setup
What Does the Training Loss Trajectory Tell Us About Transfer?
Multi-Task Transfer Occurs Early Into Training
Can Factors of the Optimization Trajectory Explain Transfer?
Factors of the Optimization Trajectory are Not Correlated with Trade-Offs in Generalization
Can Factors of the Optimization Trajectory Explain the Impact of SMTOs?
Does Gradient Conflict Explain Impact to Optimization or Generalization?
Conflict Has a Predictable (Negative) Impact to Optimization Trajectories
Conflict Does Not Have a Predictable Effect on Generalization
Conclusion
Future Directions
Limitations
...and 14 more sections

Figures (10)

Figure 1: Fashion1 (\ref{['sec:setup']}) training loss by generalization for the single-task setting (blue curve) and two multi-task settings (red and green curves). The impact of multi-task training on test accuracy (positive and negative) is detectable early into the training trajectory, at comparatively high training losses.
Figure 2: Generalization ($\mathcal{E}_k$) versus Loss ($\mathcal{L}_k$) curves for tasks which exhibit positive or negative multi-task transfer in 4 multi-task settings (for more tasks, see \ref{['app:generalization-curves-results']}). In general, multi-task trajectories converge to a higher training loss than single-task trajectories, meaning gradient conflict stops optimization early. However, transfer (positive and negative) is exhibited as a generalization gap between single-task and multi-task trajectories at comparably high training losses, i.e. transfer can be observed early into training. In other words, multi-task transfer is a property of how gradient conflict impacts the early phase of learning, rather than a property of how well the task training loss is minimized. Therefore, negative transfer must be explained by higher order factors of the optimization trajectory than the training loss.
Figure 3: Factors of the optimization trajectory are unable to simultaneously explain negative and positive transfer. We plot the trajectories of factors of the loss surface (sharpness, gradient covariance, and Fisher information) for FashionMTL, corresponding to the generalization trajectories in \ref{['fig:mnist-transfer-ex']} (similar plots for the other multi-task settings are shown in \ref{['app:generalization-curves-results']}). We expect to see the red trajectory, which yields negative transfer, exhibit worse optimization properties than the single-task trajectory (blue curve) and vice-versa for the green curve (positive transfer). Regardless of whether multi-task training resulted in negative or positive transfer, multi-task trajectories (green and red curves) exhibit better optimization properties (e.g. lower sharpness or early-phase FIM "explosions") than single-task trajectories (blue curve).
Figure 4: The impact of SMTOs on generalization vs. their impact on optimization trajectories, as their %$\Delta$ over the UMTG trajectory. SMTOs aim to impact task generalization by affecting optimization, so we expect to see positive (negative) changes to task generalization are corroborated by positive (negative) changes to at least one factor of optimization. In other words, for a factor to explain how an SMTO impacts generalization, all of an SMTOs points should exist within the shaded quadrants of a plot. However, there is no SMTO whose impacted tasks exist solely in the shaded regions, suggesting that the mechanisms by which SMTOs improve or harm task performance are not explained by task optimization trajectories.
Figure 5: The impact of gradient conflict on factors of the target-task optimization (a) and generalization (b) across auxiliary task settings. Pearson-r correlation coefficient and p-value are shown at the top. Gradient similarity is negatively correlated with each of the optimization factors that we study, implying that high gradient conflict negatively impacts many factors of task optimization; however, gradient conflict is not negatively correlated with target-task generalization. In other words, while gradient conflict has a consistent, negative impact to optimization, this effect does not predict or explain transfer.
...and 5 more figures

Can Optimization Trajectories Explain Multi-Task Transfer?

TL;DR

Abstract

Can Optimization Trajectories Explain Multi-Task Transfer?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)