Table of Contents
Fetching ...

First-order ANIL provably learns representations despite overparametrization

Oğuz Kaan Yüksel, Etienne Boursier, Nicolas Flammarion

TL;DR

This work analyzes first-order ANIL (FO-ANIL) in a linear two-layer network under a linear shared-representation model and an infinite-tasks regime. It proves that FO-ANIL learns a low-dimensional shared subspace spanned by $B_\star$ even when the network is overparameterized ($k' \ge k$), with the orthogonal complement unlearned, and shows that a single gradient step on a new task yields strong adaptation due to an $\alpha^{-1}$ scaling. The results establish a theoretical guarantee for representation learning in model-agnostic meta-learning under misspecification, contrasting with multi-task approaches that can falter when $k'<k$ or $k'>d$. Empirically, the paper demonstrates through toy experiments that FO-ANIL learns the ground-truth subspace, unlearns its orthogonal directions slowly, and achieves near-oracle performance at test time, supporting the practical relevance of the theory. Overall, the findings connect pretraining for meta-learning with representation-learning guarantees and highlight the robustness of agnostic methods to architectural misspecification, while outlining limitations and avenues for extending to finite tasks and nonlinear networks.

Abstract

Due to its empirical success in few-shot classification and reinforcement learning, meta-learning has recently received significant interest. Meta-learning methods leverage data from previous tasks to learn a new task in a sample-efficient manner. In particular, model-agnostic methods look for initialization points from which gradient descent quickly adapts to any new task. Although it has been empirically suggested that such methods perform well by learning shared representations during pretraining, there is limited theoretical evidence of such behavior. More importantly, it has not been shown that these methods still learn a shared structure, despite architectural misspecifications. In this direction, this work shows, in the limit of an infinite number of tasks, that first-order ANIL with a linear two-layer network architecture successfully learns linear shared representations. This result even holds with overparametrization; having a width larger than the dimension of the shared representations results in an asymptotically low-rank solution. The learned solution then yields a good adaptation performance on any new task after a single gradient step. Overall, this illustrates how well model-agnostic methods such as first-order ANIL can learn shared representations.

First-order ANIL provably learns representations despite overparametrization

TL;DR

This work analyzes first-order ANIL (FO-ANIL) in a linear two-layer network under a linear shared-representation model and an infinite-tasks regime. It proves that FO-ANIL learns a low-dimensional shared subspace spanned by even when the network is overparameterized (), with the orthogonal complement unlearned, and shows that a single gradient step on a new task yields strong adaptation due to an scaling. The results establish a theoretical guarantee for representation learning in model-agnostic meta-learning under misspecification, contrasting with multi-task approaches that can falter when or . Empirically, the paper demonstrates through toy experiments that FO-ANIL learns the ground-truth subspace, unlearns its orthogonal directions slowly, and achieves near-oracle performance at test time, supporting the practical relevance of the theory. Overall, the findings connect pretraining for meta-learning with representation-learning guarantees and highlight the robustness of agnostic methods to architectural misspecification, while outlining limitations and avenues for extending to finite tasks and nonlinear networks.

Abstract

Due to its empirical success in few-shot classification and reinforcement learning, meta-learning has recently received significant interest. Meta-learning methods leverage data from previous tasks to learn a new task in a sample-efficient manner. In particular, model-agnostic methods look for initialization points from which gradient descent quickly adapts to any new task. Although it has been empirically suggested that such methods perform well by learning shared representations during pretraining, there is limited theoretical evidence of such behavior. More importantly, it has not been shown that these methods still learn a shared structure, despite architectural misspecifications. In this direction, this work shows, in the limit of an infinite number of tasks, that first-order ANIL with a linear two-layer network architecture successfully learns linear shared representations. This result even holds with overparametrization; having a width larger than the dimension of the shared representations results in an asymptotically low-rank solution. The learned solution then yields a good adaptation performance on any new task after a single gradient step. Overall, this illustrates how well model-agnostic methods such as first-order ANIL can learn shared representations.
Paper Structure (43 sections, 208 equations, 17 figures, 1 table)

This paper contains 43 sections, 208 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Left: Regression tasks with parameters $\theta_{\star, i} \in \R^{d}$ are confined in a lower dimensional subspace, equivalent to the column space of matrix $B_\star \in \R^{d \times k}$. Center: This is equivalent to having two-layer, linear, teacher networks where $B_\star$ is the shared hidden layer and the outputs layers $w_{\star, i} \in \R^k$ are task-specific. The meta-learning task is finding an initialization that allows fast adaptation to any such task with a few samples. Right: The student network has the same architecture but it is agnostic to the problem hidden dimension, $k'\geq k$, which is the main difficulty in our theoretical setting. On the contrary, previous works on model-agnostic and representation learning methods assume $k'=k$, the hidden dimension is a priori known to the learner.
  • Figure 2: Evolution of smallest (left) and largest (right) squared singular value of $B_\star^\top B_t$ during training. The shaded area represents the standard deviation observed over $10$ runs.
  • Figure 3: Evolution of average (left) and largest (right) squared singular value of $B_{\star,\perp}^\top B_t$ during training. The shaded area represents the standard deviation observed over $10$ runs.
  • Figure 4: Evolution of largest (left) and smallest (right) squared singular values of $B_\star^\top B_t$ during training. The shaded area represents the standard deviation observed over $3$ runs.
  • Figure 5: Evolution of average (left) and largest (right) squared singular value of $B_{\star,\perp}^\top B_t$ during training. The shaded area represents the standard deviation observed over $3$ runs.
  • ...and 12 more figures

Theorems & Definitions (25)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • ...and 15 more