First-order ANIL provably learns representations despite overparametrization

Oğuz Kaan Yüksel; Etienne Boursier; Nicolas Flammarion

First-order ANIL provably learns representations despite overparametrization

Oğuz Kaan Yüksel, Etienne Boursier, Nicolas Flammarion

TL;DR

This work analyzes first-order ANIL (FO-ANIL) in a linear two-layer network under a linear shared-representation model and an infinite-tasks regime. It proves that FO-ANIL learns a low-dimensional shared subspace spanned by $B_\star$ even when the network is overparameterized ($k' \ge k$), with the orthogonal complement unlearned, and shows that a single gradient step on a new task yields strong adaptation due to an $\alpha^{-1}$ scaling. The results establish a theoretical guarantee for representation learning in model-agnostic meta-learning under misspecification, contrasting with multi-task approaches that can falter when $k'<k$ or $k'>d$. Empirically, the paper demonstrates through toy experiments that FO-ANIL learns the ground-truth subspace, unlearns its orthogonal directions slowly, and achieves near-oracle performance at test time, supporting the practical relevance of the theory. Overall, the findings connect pretraining for meta-learning with representation-learning guarantees and highlight the robustness of agnostic methods to architectural misspecification, while outlining limitations and avenues for extending to finite tasks and nonlinear networks.

Abstract

Due to its empirical success in few-shot classification and reinforcement learning, meta-learning has recently received significant interest. Meta-learning methods leverage data from previous tasks to learn a new task in a sample-efficient manner. In particular, model-agnostic methods look for initialization points from which gradient descent quickly adapts to any new task. Although it has been empirically suggested that such methods perform well by learning shared representations during pretraining, there is limited theoretical evidence of such behavior. More importantly, it has not been shown that these methods still learn a shared structure, despite architectural misspecifications. In this direction, this work shows, in the limit of an infinite number of tasks, that first-order ANIL with a linear two-layer network architecture successfully learns linear shared representations. This result even holds with overparametrization; having a width larger than the dimension of the shared representations results in an asymptotically low-rank solution. The learned solution then yields a good adaptation performance on any new task after a single gradient step. Overall, this illustrates how well model-agnostic methods such as first-order ANIL can learn shared representations.

First-order ANIL provably learns representations despite overparametrization

TL;DR

even when the network is overparameterized (

), with the orthogonal complement unlearned, and shows that a single gradient step on a new task yields strong adaptation due to an

scaling. The results establish a theoretical guarantee for representation learning in model-agnostic meta-learning under misspecification, contrasting with multi-task approaches that can falter when

. Empirically, the paper demonstrates through toy experiments that FO-ANIL learns the ground-truth subspace, unlearns its orthogonal directions slowly, and achieves near-oracle performance at test time, supporting the practical relevance of the theory. Overall, the findings connect pretraining for meta-learning with representation-learning guarantees and highlight the robustness of agnostic methods to architectural misspecification, while outlining limitations and avenues for extending to finite tasks and nonlinear networks.

Abstract

Paper Structure (43 sections, 208 equations, 17 figures, 1 table)

This paper contains 43 sections, 208 equations, 17 figures, 1 table.

Introduction
Contributions.
Problem setting
Data distribution
FO-ANIL algorithm
Detailed iterations
Learning a good representation
Fast adaptation to a new task
Discussion
No prior structure knowledge.
Superiority of agnostic methods.
Infinite tasks model.
Limitations.
Additional technical discussion.
Experiments
...and 28 more sections

Figures (17)

Figure 1: Left: Regression tasks with parameters $\theta_{\star, i} \in \R^{d}$ are confined in a lower dimensional subspace, equivalent to the column space of matrix $B_\star \in \R^{d \times k}$. Center: This is equivalent to having two-layer, linear, teacher networks where $B_\star$ is the shared hidden layer and the outputs layers $w_{\star, i} \in \R^k$ are task-specific. The meta-learning task is finding an initialization that allows fast adaptation to any such task with a few samples. Right: The student network has the same architecture but it is agnostic to the problem hidden dimension, $k'\geq k$, which is the main difficulty in our theoretical setting. On the contrary, previous works on model-agnostic and representation learning methods assume $k'=k$, the hidden dimension is a priori known to the learner.
Figure 2: Evolution of smallest (left) and largest (right) squared singular value of $B_\star^\top B_t$ during training. The shaded area represents the standard deviation observed over $10$ runs.
Figure 3: Evolution of average (left) and largest (right) squared singular value of $B_{\star,\perp}^\top B_t$ during training. The shaded area represents the standard deviation observed over $10$ runs.
Figure 4: Evolution of largest (left) and smallest (right) squared singular values of $B_\star^\top B_t$ during training. The shaded area represents the standard deviation observed over $3$ runs.
Figure 5: Evolution of average (left) and largest (right) squared singular value of $B_{\star,\perp}^\top B_t$ during training. The shaded area represents the standard deviation observed over $3$ runs.
...and 12 more figures

Theorems & Definitions (25)

proof
proof
proof
proof
proof
proof
proof
proof
proof
proof
...and 15 more

First-order ANIL provably learns representations despite overparametrization

TL;DR

Abstract

First-order ANIL provably learns representations despite overparametrization

Authors

TL;DR

Abstract

Table of Contents

Figures (17)

Theorems & Definitions (25)