Table of Contents
Fetching ...

Asymptotic Behavior of Multi--Task Learning: Implicit Regularization and Double Descent Effects

Ayed M. Alrashdi, Oussama Dhifallah, Houssem Sifaou

TL;DR

It is shown that combining multiple tasks is asymptotically equivalent to a traditional formulation with additional regularization terms that help improve the generalization performance.

Abstract

Multi--task learning seeks to improve the generalization error by leveraging the common information shared by multiple related tasks. One challenge in multi--task learning is identifying formulations capable of uncovering the common information shared between different but related tasks. This paper provides a precise asymptotic analysis of a popular multi--task formulation associated with misspecified perceptron learning models. The main contribution of this paper is to precisely determine the reasons behind the benefits gained from combining multiple related tasks. Specifically, we show that combining multiple tasks is asymptotically equivalent to a traditional formulation with additional regularization terms that help improve the generalization performance. Another contribution is to empirically study the impact of combining tasks on the generalization error. In particular, we empirically show that the combination of multiple tasks postpones the double descent phenomenon and can mitigate it asymptotically.

Asymptotic Behavior of Multi--Task Learning: Implicit Regularization and Double Descent Effects

TL;DR

It is shown that combining multiple tasks is asymptotically equivalent to a traditional formulation with additional regularization terms that help improve the generalization performance.

Abstract

Multi--task learning seeks to improve the generalization error by leveraging the common information shared by multiple related tasks. One challenge in multi--task learning is identifying formulations capable of uncovering the common information shared between different but related tasks. This paper provides a precise asymptotic analysis of a popular multi--task formulation associated with misspecified perceptron learning models. The main contribution of this paper is to precisely determine the reasons behind the benefits gained from combining multiple related tasks. Specifically, we show that combining multiple tasks is asymptotically equivalent to a traditional formulation with additional regularization terms that help improve the generalization performance. Another contribution is to empirically study the impact of combining tasks on the generalization error. In particular, we empirically show that the combination of multiple tasks postpones the double descent phenomenon and can mitigate it asymptotically.
Paper Structure (26 sections, 7 theorems, 68 equations, 7 figures)

This paper contains 26 sections, 7 theorems, 68 equations, 7 figures.

Key Result

Theorem 1

Let Assumptions rad_fv--highdim hold. In addition, assume that all tasks have the same training set size, i.e., $\alpha_t = \alpha$ for all $t \in \lbrace 1, \dots, T \rbrace$. Under these conditions, the generalization error defined in gener_trg associated with the $t^{\text{th}}$ task converges in In the above, $G_1$ and $G_2$ are independent standard Gaussian random variables. Also, $c_0$, $c_{

Figures (7)

  • Figure 1: Solid lines: Theoretical predictions. Circles: Numerical simulations for the multi--task formulation. (a) A squared loss and a linear regression model. (b) A logistic loss and a binary classification model. The results show a double descent pattern in the generalization error: the sweet spot is zero for the regression model and strictly positive for the classification model. Note that the position of the interpolation threshold varies based on how many tasks are included. It is also evident that increasing the number of tasks contributes to improved generalization performance.
  • Figure 2: Continuous lines: Theoretical predictions. Circles: Numerical simulations for the multi--task formulation. (a) We consider the linear regression model and the squared loss. We set $p=2000$, $\alpha=5$, $\rho=0.8$, $\gamma_1=10^{-2}$ and $T=3$. (b) We consider the binary classification model and the logistic loss. We set $p=600$, $\alpha=1$, $\rho=0.8$, $\gamma_1=10^{-4}$ and $T=2$. The results are averaged over $25$ independent Monte Carlo trials.
  • Figure 3: Continuous line: Theoretical predictions. Circle: Numerical simulations for the multi--task formulation. (a) We consider the linear regression model and the squared loss. We set $p=1000$, $\alpha=2$, $\kappa=0.5$, $\gamma_1=0.1$, $\gamma_2=0.5$ and $\rho=0.85$. (b) We consider the binary classification model and the squared loss. We set $p=1000$, $\alpha=2$, $\kappa=1$, $\gamma_1=0.05$, $\gamma_2=0.2$ and $\rho=0.75$. The results are averaged over $100$ independent Monte Carlo trials.
  • Figure 4: (a) Continuous lines: Theoretical predictions. Circles: Numerical simulations for the multi--task and separate formulations. We consider the binary classification model and the squared loss. We set $\alpha=4$, $\kappa=2$, $\gamma_1=0.1$, $\gamma_2=1$ and $\rho=0.3$. (b) The value of $R(\rho)$ as a function of the similarity measure $\rho$. We consider the binary classification model and the squared loss. We used $p=1000$, $\alpha=2$, $\kappa=1$, $\gamma_1=0.01$, $\gamma_2=0.6$ and $\rho=0.75$. The results are averaged over $100$ independent trials.
  • Figure 5: Solid lines: Theoretical predictions in Theorem 2. Circles: numerical simulation for the multi--task formulation. We consider two tasks in the multi--task formulation. The parameters are set as follows $p=2000$, $\alpha=4$, $\rho=0.7$, $T=2$, $\gamma_1=0.005$ and $\gamma_1=1$. Moreover, we take $\alpha_1=\alpha$ and $\alpha_2=\alpha/2$. (a) The performance of the first task. (b) The performance of the second task. The results are averaged over $100$ independent Monte Carlo trials.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Theorem 1: Symmetric Multi--Task Analysis
  • proof
  • Lemma 1: Large Number of Tasks
  • proof
  • Lemma 2: Separate Formulation
  • proof
  • Corollary 1: Regularization Effects
  • Theorem 2: General Multi--Task Analysis
  • proof
  • Theorem 3: MCGMT dhi21inherent
  • ...and 1 more