Nonlinear Meta-Learning Can Guarantee Faster Rates

Dimitri Meunier; Zhu Li; Arthur Gretton; Samory Kpotufe

Nonlinear Meta-Learning Can Guarantee Faster Rates

Dimitri Meunier, Zhu Li, Arthur Gretton, Samory Kpotufe

TL;DR

The present work derives theoretical guarantees for meta-learning with nonlinear representations, and shows that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions, yielding improved rates that scale with the number of tasks as desired.

Abstract

Many recent theoretical works on \emph{meta-learning} aim to achieve guarantees in leveraging similar representational structures from related tasks towards simplifying a target task. The main aim of theoretical guarantees on the subject is to establish the extent to which convergence rates -- in learning a common representation -- \emph{may scale with the number $N$ of tasks} (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite dimensional reproducing kernel Hilbert space, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions, yielding improved rates that scale with the number of tasks as desired.

Nonlinear Meta-Learning Can Guarantee Faster Rates

TL;DR

Abstract

of tasks} (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite dimensional reproducing kernel Hilbert space, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions, yielding improved rates that scale with the number of tasks as desired.

Paper Structure (28 sections, 30 theorems, 245 equations, 3 figures, 1 table)

This paper contains 28 sections, 30 theorems, 245 equations, 3 figures, 1 table.

Introduction
Background & Notations
Nonlinear Meta-learning
Population Set-up
Learning Set-up
Instantiation in Data Space
Main Results
Regularity Assumptions
Main Theorems
Regimes of Gain.
Characterizing $\alpha$, $p$ and $r$.
Experimental Results
Parameter values: $p$, $\alpha$ and $r$.
Choice of regularization.
Learning at the parametric rate.
...and 13 more sections

Key Result

Proposition 1

Consider the generalized eigenvalue problem which consists of finding generalized eigenvectors $(\alpha^{\top}, \beta^{\top})^{\top} \in \mathbb{R}^{2N}$ and generalized eigenvalues $\gamma \in \mathbb{R}$ such that Define $A \doteq [\hat{f}_1', \ldots, \hat{f}_N']$ and $B \doteq [\hat{f}_1, \ldots, \hat{f}_N]$ and let $\{(\hat{\alpha}^{\top}_i, \hat{\beta}^{\top}_i)^{\top}\}_{i=1}^s$ be the gene

Figures (3)

Figure 1: (Left)-(Center) Orthonormal system in $\mathcal{H}$ spanning $\mathcal{H}_s$ for respectively $s=3$ (Left) and $s=10$ (Center). (Right) Example of sampled task for $s=10$ with $300$ datapoints, the blue solid line represents the ground truth.
Figure 2: (Left) Meta Learning versus Oracle: Comparison of the squared excess risk on the target task for the oracle estimator $\hat{f}_{\text{oracle}}$ (dotted red line) and the meta learning estimator $\hat{f}_{T,\lambda_*}$ trained with different number of tasks $N$ (solid lines). $x-$axis represents the size of the dataset for the target task $(n_T)$. (Right) Effect of under-regularization: Comparison of the squared excess risk of the meta learning estimator trained with $\lambda = (nN)^{-\frac{2}{5}}$ (red dotted line) and $\lambda = n^{-\frac{2}{5}}$ (blue solid line). $x-$axis represents the number of source tasks $(N)$. For both figures $n=300$, $s=25$ and results are averaged over $100$ generations of the source and target tasks.
Figure 3: (Left) Meta Learning versus Oracle: Comparison of the squared excess risk on the target task for the oracle estimator $\hat{f}_{\text{oracle}}$ (dotted red line) and the meta learning estimator $\hat{f}_{T,\lambda_*}$ trained with different number of tasks $N$ (solid lines). $x-$axis represents the size of the dataset for the target task $(n_T)$. (Right) Effect of under-regularization: Comparison of the squared excess risk of the meta learning estimator trained with $\lambda = (nN)^{-\frac{2}{5}}$ (red dotted line) and $\lambda = n^{-\frac{2}{5}}$ (blue solid line). $x-$axis represents the number of source tasks $(N)$. For both figures $n=500$, $s=50$ and results are averaged over $20$ generations of the source and target tasks.

Theorems & Definitions (74)

Remark 1
Remark 2
Remark 3
Remark 4
Remark 5: Differences from Linear Case
Proposition 1
Proposition 2
Remark 6
Theorem 1
Proposition 3: Wedin’s $\sin-\Theta$ Theorem
...and 64 more

Nonlinear Meta-Learning Can Guarantee Faster Rates

TL;DR

Abstract

Nonlinear Meta-Learning Can Guarantee Faster Rates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (74)