Table of Contents
Fetching ...

Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

Alfredo Reichlin, Gustaf Tegnér, Miguel Vasco, Hang Yin, Mårten Björkman, Danica Kragic

TL;DR

This work proposes a novel approach that reduces the variance of the gradient estimate by weighing each support point individually by the variance of its posterior over the parameters, using the Laplace approximation.

Abstract

Given a finite set of sample points, meta-learning algorithms aim to learn an optimal adaptation strategy for new, unseen tasks. Often, this data can be ambiguous as it might belong to different tasks concurrently. This is particularly the case in meta-regression tasks. In such cases, the estimated adaptation strategy is subject to high variance due to the limited amount of support data for each task, which often leads to sub-optimal generalization performance. In this work, we address the problem of variance reduction in gradient-based meta-learning and formalize the class of problems prone to this, a condition we refer to as \emph{task overlap}. Specifically, we propose a novel approach that reduces the variance of the gradient estimate by weighing each support point individually by the variance of its posterior over the parameters. To estimate the posterior, we utilize the Laplace approximation, which allows us to express the variance in terms of the curvature of the loss landscape of our meta-learner. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of variance reduction in meta-learning.

Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks

TL;DR

This work proposes a novel approach that reduces the variance of the gradient estimate by weighing each support point individually by the variance of its posterior over the parameters, using the Laplace approximation.

Abstract

Given a finite set of sample points, meta-learning algorithms aim to learn an optimal adaptation strategy for new, unseen tasks. Often, this data can be ambiguous as it might belong to different tasks concurrently. This is particularly the case in meta-regression tasks. In such cases, the estimated adaptation strategy is subject to high variance due to the limited amount of support data for each task, which often leads to sub-optimal generalization performance. In this work, we address the problem of variance reduction in gradient-based meta-learning and formalize the class of problems prone to this, a condition we refer to as \emph{task overlap}. Specifically, we propose a novel approach that reduces the variance of the gradient estimate by weighing each support point individually by the variance of its posterior over the parameters. To estimate the posterior, we utilize the Laplace approximation, which allows us to express the variance in terms of the curvature of the loss landscape of our meta-learner. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of variance reduction in meta-learning.
Paper Structure (34 sections, 1 theorem, 25 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 34 sections, 1 theorem, 25 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

The solution to eq:var_reduce is given by:

Figures (8)

  • Figure 1: The Problem of Task Overlap in Regression Tasks: On the left: Three support points of a single task are marked out. These points are shared between different tasks which are marked out by the opaque curves. On the right: Each support point induces a distribution in the parameter space. The true function parameters $\theta^*$ lie at the intersection over the possible function values.
  • Figure 2: The space of task parameters adapts to the Hessian. The sum of the logarithm of the loss for 3 support points over different parameters for the sine experiment. Values increase from white to dark blue. The red cross and the red diamond indicate the prior and the posterior, orange points are the single task-adapted parameters. Top row: Results for CAVIA. Bottom row: Results for LAVA, included is also the covariance for each support point.
  • Figure 3: Evaluation of Computation Time: We evaluate the performance of the different methods (measured in terms of MSE) as a function of the computational time (in seconds) across different support sizes. The results show that, considering the same execution time (x-axis), LAVA outperforms both CAVIA and ANIL, with lower MSE. This result holds across all evaluated support sizes.
  • Figure 4: Estimator's variance.Variance of the task-adapted parameters given the same task but different support data points. Left: For CAVIA. Center: For LAVA. Right: Log variance of the distribution of the adapted parameters during training.
  • Figure 5: ODEs qualitative results. Rollout Trajectories for the dynamical systems' prediction with CAVIA, LAVA and ANIL. We consider the dynamics of systems with random parameters for $5$ initial conditions. LAVA (blue line) is the only model that consistently predicts the evolution of the system (red dotted line).
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1: Task Overlap
  • Proposition 1: Variance Reduction