Table of Contents
Fetching ...

The Effects of Multi-Task Learning on ReLU Neural Network Functions

Julia Nakhleh, Joseph Shenouda, Robert D. Nowak

TL;DR

The paper investigates weight-decay trained, shallow, multi-output ReLU networks for multi-task interpolation and uncovers a sharp dichotomy between single-task and multi-task solutions. In the univariate setting, it proves that multi-task interpolation is almost surely unique and coincides with the connect-the-dots linear spline, which is the minimum-norm interpolant in the Sobolev RKHS $H^1([x_1,x_N])$, while single-task solutions generally reside in the non-Hilbert $ ext{BV}^2$ space. For many tasks in the multivariate setting, the authors show the learned solution is well-approximated by RKHS ridge regression over a fixed kernel determined by the optimal neurons, with per-task penalties converging to a common scale as $T$ grows; this reveals a fundamental RKHS kernel interpretation of multi-task learning with ReLU networks and contrasts with the $ ext{L}^1$-like behavior seen in single-task cases. Together, these results establish a concrete bridge between shallow ReLU networks under weight decay and kernel methods, offering insights into generalization, robustness, and the potential for kernel-based analyses in multi-task neural settings.

Abstract

This paper studies the properties of solutions to multi-task shallow ReLU neural network learning problems, wherein the network is trained to fit a dataset with minimal sum of squared weights. Remarkably, the solutions learned for each individual task resemble those obtained by solving a kernel regression problem, revealing a novel connection between neural networks and kernel methods. It is known that single-task neural network learning problems are equivalent to a minimum norm interpolation problem in a non-Hilbertian Banach space, and that the solutions of such problems are generally non-unique. In contrast, we prove that the solutions to univariate-input, multi-task neural network interpolation problems are almost always unique, and coincide with the solution to a minimum-norm interpolation problem in a Sobolev (Reproducing Kernel) Hilbert Space. We also demonstrate a similar phenomenon in the multivariate-input case; specifically, we show that neural network learning problems with large numbers of tasks are approximately equivalent to an $\ell^2$ (Hilbert space) minimization problem over a fixed kernel determined by the optimal neurons.

The Effects of Multi-Task Learning on ReLU Neural Network Functions

TL;DR

The paper investigates weight-decay trained, shallow, multi-output ReLU networks for multi-task interpolation and uncovers a sharp dichotomy between single-task and multi-task solutions. In the univariate setting, it proves that multi-task interpolation is almost surely unique and coincides with the connect-the-dots linear spline, which is the minimum-norm interpolant in the Sobolev RKHS , while single-task solutions generally reside in the non-Hilbert space. For many tasks in the multivariate setting, the authors show the learned solution is well-approximated by RKHS ridge regression over a fixed kernel determined by the optimal neurons, with per-task penalties converging to a common scale as grows; this reveals a fundamental RKHS kernel interpretation of multi-task learning with ReLU networks and contrasts with the -like behavior seen in single-task cases. Together, these results establish a concrete bridge between shallow ReLU networks under weight decay and kernel methods, offering insights into generalization, robustness, and the potential for kernel-based analyses in multi-task neural settings.

Abstract

This paper studies the properties of solutions to multi-task shallow ReLU neural network learning problems, wherein the network is trained to fit a dataset with minimal sum of squared weights. Remarkably, the solutions learned for each individual task resemble those obtained by solving a kernel regression problem, revealing a novel connection between neural networks and kernel methods. It is known that single-task neural network learning problems are equivalent to a minimum norm interpolation problem in a non-Hilbertian Banach space, and that the solutions of such problems are generally non-unique. In contrast, we prove that the solutions to univariate-input, multi-task neural network interpolation problems are almost always unique, and coincide with the solution to a minimum-norm interpolation problem in a Sobolev (Reproducing Kernel) Hilbert Space. We also demonstrate a similar phenomenon in the multivariate-input case; specifically, we show that neural network learning problems with large numbers of tasks are approximately equivalent to an (Hilbert space) minimization problem over a fixed kernel determined by the optimal neurons.

Paper Structure

This paper contains 23 sections, 10 theorems, 61 equations, 10 figures.

Key Result

Theorem 3.1

The connect-the-dots function $f_{{\mathcal{D}}}$ is always a solution to opt:pn. Moreover, the solution to problem opt:pn is non-unique if and only if the following condition is satisfied: for some $i = 2, \dots, N-2$, the two vectors and are both nonzero and aligned.Two vectors ${\bm{u}}_1$ and ${\bm{u}}_2$ are aligned if ${\bm{u}}_1^{\top}{\bm{u}}_2 = \|{\bm{u}}_1\|\|{\bm{u}}_2\|$. If this co

Figures (10)

  • Figure 1: Two solutions to ReLU neural network interpolation (blue) of training data (red). The functions on the left and right both interpolate the data and both are global minimizers of \ref{['opt:wd']} and \ref{['opt:pn']}, and minimize the second-order total variation of the interpolation function parhi2021banach. In fact, all convex combinations of the two solutions above are also solutions to this learning problem.
  • Figure 2: The connect-the-dots interpolant $f_{{\mathcal{D}}} = (f_{{\mathcal{D}}_1}, f_{{\mathcal{D}}_2}, f_{{\mathcal{D}}_3})$ of three datasets ${\mathcal{D}}_1, {\mathcal{D}}_2, {\mathcal{D}}_3$.
  • Figure 3: The function output $f_t$ around the knot at $\tilde{x}$, where $\tau = \frac{\tilde{x}-\tilde{x}_1}{\tilde{x}_2-\tilde{x}_1}$. Each line segment in the figure is labeled with its slope. For any particular output $t$, it may be the case that $f_t$ does not have a knot at $\tilde{x}$ (in which case $\delta_t = 0$); does not have a knot at $\tilde{x}_1$ (in which case $a_t = b_t + \delta_t$); and/or does not have a knot at $\tilde{x}_2$ (in which case $b_t - \frac{\tau}{1-\tau} \delta_t = c_t$).
  • Figure 4: Top Row: Three randomly initialized neural networks trained to interpolate the five red points with minimum sum of squared weights. Bottom Row: Three randomly initialized two-output neural networks trained to interpolate a multi-task dataset with minimum sum of squared weights. The labels for the first task are the five red points shown while the labels for the second were randomly sampled from a standard Gaussian distribution. [pdfnewwindow=true]https://github.com/joeshenouda/effects-mtl-nns
  • Figure 5: ReLU network interpolation in two-dimensions. The solutions shown were obtained with regularization parameter $\lambda \approx 0$. Top Row -- Solutions to single-task training: \ref{['fig:sing_sol_1', 'fig:sing_sol_2', 'fig:sing_sol_3']} show solutions to ReLU neural network interpolation (blue surface) of training data (red). The eight data points are located at the vertices of two squares, both centered at the origin. The outer square has side-length two and values of $0$ at the vertices. The inner square has side-length one and values of $1$ at the vertices. All three functions interpolate the data and are global minimizers of \ref{['opt:wd']} and \ref{['opt:pn']} when solving for just this task (i.e., $T=1$). Due to the simplicity of this dataset the optimality of the solutions in the first row were confirmed by solving the equivalent convex optimization to \ref{['opt:wd']} developed in ergen2021convex. Bottom Row -- Solutions to multi-task training:\ref{['fig:mtl_sol']} shows the solution to the first output of a multi-task neural network with $T=101$ tasks. The first output is the original task depicted in the first row while the labels for other $100$ tasks are randomly generated i.i.d from a Bernoulli distribution with equal probability for one and zero. Here we show one representative example; more examples are depicted in \ref{['appendix:additional_experiments']} showing that this phenomenon holds across many runs. \ref{['fig:rkhs_sol']} shows the solution to fitting the training data by solving \ref{['opt:RKHS_problem']} over a fixed set of features learned by the multi-task neural network with $T=100$ random tasks. We observe that unlike the highly variable solutions of single-task optimization problem, the solutions obtained by solving the multi-task optimizations are nearly identical, as one would have for kernel methods. Moreover, the solution obtained by solving \ref{['opt:RKHS_problem']} is also similar to the solution of the full multi-task training problem with all $T=101$ tasks. [pdfnewwindow=true]https://github.com/joeshenouda/effects-mtl-nns
  • ...and 5 more figures

Theorems & Definitions (19)

  • Theorem 3.1
  • Corollary 1
  • Remark 1
  • Lemma 3.2
  • proof : Proof of \ref{['lemma:keylemma']}
  • Theorem 4.1
  • Lemma 4.2
  • proof
  • proof
  • Proposition 1
  • ...and 9 more