Table of Contents
Fetching ...

Adaptive Gradient-Based Meta-Learning Methods

Mikhail Khodak, Maria-Florina Balcan, Ameet Talwalkar

TL;DR

ARUBA reframes gradient-based meta-learning as online learning of regret upper-bounds, enabling adaptive learning of task similarity and dynamic environment handling through online mirror descent with Bregman divergences. It provides theoretical guarantees for static, dynamic, and statistical learning-to-learn settings, and introduces practical per-coordinate learning-rate mechanisms (ARUBA and variants) that automatically adapt to task structure and geometry. Empirically, ARUBA improves meta-test-time performance on few-shot classification and federated learning benchmarks, while reducing the need for hyperparameter tuning. Overall, ARUBA offers a principled, tunable framework that extends GBML to adaptive, geometry-aware, and communication-efficient settings with strong transfer guarantees.

Abstract

We build a theoretical framework for designing and understanding practical meta-learning methods that integrates sophisticated formalizations of task-similarity with the extensive literature on online convex optimization and sequential prediction algorithms. Our approach enables the task-similarity to be learned adaptively, provides sharper transfer-risk bounds in the setting of statistical learning-to-learn, and leads to straightforward derivations of average-case regret bounds for efficient algorithms in settings where the task-environment changes dynamically or the tasks share a certain geometric structure. We use our theory to modify several popular meta-learning algorithms and improve their meta-test-time performance on standard problems in few-shot learning and federated learning.

Adaptive Gradient-Based Meta-Learning Methods

TL;DR

ARUBA reframes gradient-based meta-learning as online learning of regret upper-bounds, enabling adaptive learning of task similarity and dynamic environment handling through online mirror descent with Bregman divergences. It provides theoretical guarantees for static, dynamic, and statistical learning-to-learn settings, and introduces practical per-coordinate learning-rate mechanisms (ARUBA and variants) that automatically adapt to task structure and geometry. Empirically, ARUBA improves meta-test-time performance on few-shot classification and federated learning benchmarks, while reducing the need for hyperparameter tuning. Overall, ARUBA offers a principled, tunable framework that extends GBML to adaptive, geometry-aware, and communication-efficient settings with strong transfer guarantees.

Abstract

We build a theoretical framework for designing and understanding practical meta-learning methods that integrates sophisticated formalizations of task-similarity with the extensive literature on online convex optimization and sequential prediction algorithms. Our approach enables the task-similarity to be learned adaptively, provides sharper transfer-risk bounds in the setting of statistical learning-to-learn, and leads to straightforward derivations of average-case regret bounds for efficient algorithms in settings where the task-environment changes dynamically or the tasks share a certain geometric structure. We use our theory to modify several popular meta-learning algorithms and improve their meta-test-time performance on standard problems in few-shot learning and federated learning.

Paper Structure

This paper contains 32 sections, 42 theorems, 124 equations, 7 figures, 1 table, 3 algorithms.

Key Result

Theorem 3.1

Assume $\Theta\subset\mathbb R^d$ is convex, each task $t\in[T]$ is a sequence of $m_t$ convex losses $\ell_{t,i}:\Theta\mapsto\mathbb R$ with mean squared Lipschitz constant $G_t^2$, and $R:\Theta\mapsto\mathbb R$ is 1-strongly-convex. If Algorithm alg:general sets $\phi_t=\operatorname{INIT}(t)$ and $\eta_t=\frac{\operatorname{SIM}(t)}{G_t\sqrt{m_t}}$ then for $V_\Psi^2=\frac{\sum_{t=1}^T\mathc

Figures (7)

  • Figure 1: Left - Theorem \ref{['thm:similarity']} improves upon khodak:19 via its dependence on the average deviation $V$ rather than the maximal deviation $D^\ast$ of the optimal task-parameters $\theta_t^\ast$ (light blue). Right - a case where Theorem \ref{['thm:dynamic']} yields a strong task-similarity-based guarantee via a dynamic comparator $\Psi$ despite the deviation $V$ being large.
  • Figure 2: Learning rate variation across layers of a convolutional net trained on Mini-ImageNet using Algorithm \ref{['alg:aruba']}. Following intuition outlined in Section \ref{['sec:empirical']}, shared feature extractors are not updated much if at all compared to higher layers.
  • Figure 3: ARUBA: an approach for modifying a generic batch GBML method to learn a per-coordinate learning rate. Two specialized variants provided below.
  • Figure 4: Final learning rate $\eta_T$ across the layers of a convolutional network trained on 1-shot 5-way Omniglot (top) and 5-shot 5-way Omniglot (bottom) using Algorithm \ref{['alg:aruba']} applied to Reptile.
  • Figure 5: Final learning rate $\eta_T$ across the layers of a convolutional network trained on 1-shot 20-way Omniglot (top) and 5-shot 20-way Omniglot (bottom) using Algorithm \ref{['alg:aruba']} applied to Reptile.
  • ...and 2 more figures

Theorems & Definitions (98)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • Remark 3.1
  • Theorem 3.3
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 5.1
  • Corollary 5.1
  • Definition A.1
  • ...and 88 more