Table of Contents
Fetching ...

Nonlinear Meta-Learning Can Guarantee Faster Rates

Dimitri Meunier, Zhu Li, Arthur Gretton, Samory Kpotufe

TL;DR

The present work derives theoretical guarantees for meta-learning with nonlinear representations, and shows that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions, yielding improved rates that scale with the number of tasks as desired.

Abstract

Many recent theoretical works on \emph{meta-learning} aim to achieve guarantees in leveraging similar representational structures from related tasks towards simplifying a target task. The main aim of theoretical guarantees on the subject is to establish the extent to which convergence rates -- in learning a common representation -- \emph{may scale with the number $N$ of tasks} (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite dimensional reproducing kernel Hilbert space, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions, yielding improved rates that scale with the number of tasks as desired.

Nonlinear Meta-Learning Can Guarantee Faster Rates

TL;DR

The present work derives theoretical guarantees for meta-learning with nonlinear representations, and shows that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions, yielding improved rates that scale with the number of tasks as desired.

Abstract

Many recent theoretical works on \emph{meta-learning} aim to achieve guarantees in leveraging similar representational structures from related tasks towards simplifying a target task. The main aim of theoretical guarantees on the subject is to establish the extent to which convergence rates -- in learning a common representation -- \emph{may scale with the number of tasks} (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite dimensional reproducing kernel Hilbert space, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions, yielding improved rates that scale with the number of tasks as desired.
Paper Structure (28 sections, 30 theorems, 245 equations, 3 figures, 1 table)

This paper contains 28 sections, 30 theorems, 245 equations, 3 figures, 1 table.

Key Result

Proposition 1

Consider the generalized eigenvalue problem which consists of finding generalized eigenvectors $(\alpha^{\top}, \beta^{\top})^{\top} \in \mathbb{R}^{2N}$ and generalized eigenvalues $\gamma \in \mathbb{R}$ such that Define $A \doteq [\hat{f}_1', \ldots, \hat{f}_N']$ and $B \doteq [\hat{f}_1, \ldots, \hat{f}_N]$ and let $\{(\hat{\alpha}^{\top}_i, \hat{\beta}^{\top}_i)^{\top}\}_{i=1}^s$ be the gene

Figures (3)

  • Figure 1: (Left)-(Center) Orthonormal system in $\mathcal{H}$ spanning $\mathcal{H}_s$ for respectively $s=3$ (Left) and $s=10$ (Center). (Right) Example of sampled task for $s=10$ with $300$ datapoints, the blue solid line represents the ground truth.
  • Figure 2: (Left) Meta Learning versus Oracle: Comparison of the squared excess risk on the target task for the oracle estimator $\hat{f}_{\text{oracle}}$ (dotted red line) and the meta learning estimator $\hat{f}_{T,\lambda_*}$ trained with different number of tasks $N$ (solid lines). $x-$axis represents the size of the dataset for the target task $(n_T)$. (Right) Effect of under-regularization: Comparison of the squared excess risk of the meta learning estimator trained with $\lambda = (nN)^{-\frac{2}{5}}$ (red dotted line) and $\lambda = n^{-\frac{2}{5}}$ (blue solid line). $x-$axis represents the number of source tasks $(N)$. For both figures $n=300$, $s=25$ and results are averaged over $100$ generations of the source and target tasks.
  • Figure 3: (Left) Meta Learning versus Oracle: Comparison of the squared excess risk on the target task for the oracle estimator $\hat{f}_{\text{oracle}}$ (dotted red line) and the meta learning estimator $\hat{f}_{T,\lambda_*}$ trained with different number of tasks $N$ (solid lines). $x-$axis represents the size of the dataset for the target task $(n_T)$. (Right) Effect of under-regularization: Comparison of the squared excess risk of the meta learning estimator trained with $\lambda = (nN)^{-\frac{2}{5}}$ (red dotted line) and $\lambda = n^{-\frac{2}{5}}$ (blue solid line). $x-$axis represents the number of source tasks $(N)$. For both figures $n=500$, $s=50$ and results are averaged over $20$ generations of the source and target tasks.

Theorems & Definitions (74)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5: Differences from Linear Case
  • Proposition 1
  • Proposition 2
  • Remark 6
  • Theorem 1
  • Proposition 3: Wedin’s $\sin-\Theta$ Theorem
  • ...and 64 more