Table of Contents
Fetching ...

Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm

Chelsea Finn, Sergey Levine

TL;DR

This paper asks whether gradient-based meta-learning, implemented via deep representations updated by standard gradient steps, can replicate any learning algorithm. It proves one-shot and K-shot universality results for MAML-style learners, showing that with an expressive enough architecture and a bias-transformation, gradient descent can emulate arbitrary learning procedures. It further identifies loss functions that support this universality (mean-squared error and cross-entropy) and demonstrates empirically that gradient-based meta-learners generalize better to out-of-distribution tasks and benefit from depth. Overall, the work positions gradient-based meta-learning as not only as expressive as recurrent meta-learners but also empirically advantageous in terms of generalization and robustness to overfitting.

Abstract

Learning to learn is a powerful paradigm for enabling models to learn from data more effectively and efficiently. A popular approach to meta-learning is to train a recurrent model to read in a training dataset as input and output the parameters of a learned model, or output predictions for new test inputs. Alternatively, a more recent approach to meta-learning aims to acquire deep representations that can be effectively fine-tuned, via standard gradient descent, to new tasks. In this paper, we consider the meta-learning problem from the perspective of universality, formalizing the notion of learning algorithm approximation and comparing the expressive power of the aforementioned recurrent models to the more recent approaches that embed gradient descent into the meta-learner. In particular, we seek to answer the following question: does deep representation combined with standard gradient descent have sufficient capacity to approximate any learning algorithm? We find that this is indeed true, and further find, in our experiments, that gradient-based meta-learning consistently leads to learning strategies that generalize more widely compared to those represented by recurrent models.

Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm

TL;DR

This paper asks whether gradient-based meta-learning, implemented via deep representations updated by standard gradient steps, can replicate any learning algorithm. It proves one-shot and K-shot universality results for MAML-style learners, showing that with an expressive enough architecture and a bias-transformation, gradient descent can emulate arbitrary learning procedures. It further identifies loss functions that support this universality (mean-squared error and cross-entropy) and demonstrates empirically that gradient-based meta-learners generalize better to out-of-distribution tasks and benefit from depth. Overall, the work positions gradient-based meta-learning as not only as expressive as recurrent meta-learners but also empirically advantageous in terms of generalization and robustness to overfitting.

Abstract

Learning to learn is a powerful paradigm for enabling models to learn from data more effectively and efficiently. A popular approach to meta-learning is to train a recurrent model to read in a training dataset as input and output the parameters of a learned model, or output predictions for new test inputs. Alternatively, a more recent approach to meta-learning aims to acquire deep representations that can be effectively fine-tuned, via standard gradient descent, to new tasks. In this paper, we consider the meta-learning problem from the perspective of universality, formalizing the notion of learning algorithm approximation and comparing the expressive power of the aforementioned recurrent models to the more recent approaches that embed gradient descent into the meta-learner. In particular, we seek to answer the following question: does deep representation combined with standard gradient descent have sufficient capacity to approximate any learning algorithm? We find that this is indeed true, and further find, in our experiments, that gradient-based meta-learning consistently leads to learning strategies that generalize more widely compared to those represented by recurrent models.

Paper Structure

This paper contains 25 sections, 9 theorems, 48 equations, 6 figures.

Key Result

Lemma 4.1

Let us assume that $\overline{e}(\mathbf{y})$ can be chosen to be any linear (but not affine) function of $\mathbf{y}$. Then, we can choose $\theta_\text{ft}$, $\theta_h$, $\{A_i; i>1\}$, $\{B_i; i<N\}$ such that the function can approximate any continuous function of $(\mathbf{x}, \mathbf{y}, \mathbf{x}^\star)$ on compact subsets of $\mathbb{R}^{\dim(\mathbf{y})}$.The assumption with regard to c

Figures (6)

  • Figure 1: A deep fully-connected neural network with N+2 layers and ReLU nonlinearities. With this generic fully connected network, we prove that, with a single step of gradient descent, the model can approximate any function of the dataset and test input.
  • Figure 2: The effect of additional gradient steps at test time when attempting to solve new tasks. The MAML model, trained with $5$ inner gradient steps, can further improve with more steps. All methods are provided with the same data -- 5 examples -- where each gradient step is computed using the same 5 datapoints.
  • Figure 3: Learning performance on out-of-distribution tasks as a function of the task variability. Recurrent meta-learners such as SNAIL and MetaNet acquire learning strategies that are less generalizable than those learned with gradient-based meta-learning.
  • Figure 4: Comparison of finetuning from a MAML-initialized network and a network initialized randomly, trained from scratch. Both methods achieve about the same training accuracy. But, MAML also attains good test accuracy, while the network trained from scratch overfits catastrophically to the 20 examples. Interestingly, the MAML-initialized model does not begin to overfit, even though meta-training used 5 steps while the graph shows up to 100.
  • Figure 5: Comparison of depth while keeping the number of parameters constant. Task-conditioned models do not need more than one hidden layer, whereas meta-learning with MAML clearly benefits from additional depth. Error bars show standard deviation over three training runs.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Lemma 4.1
  • Theorem 6.1
  • Theorem 6.2
  • Lemma A.1
  • Lemma A.1
  • Lemma A.1
  • Lemma C.1
  • Theorem E.1
  • Theorem F.1