Universal Algorithm-Implicit Learning

Stefano Woerner; Seong Joon Oh; Christian F. Baumgartner

Universal Algorithm-Implicit Learning

Stefano Woerner, Seong Joon Oh, Christian F. Baumgartner

TL;DR

A theoretical framework for meta-learning is introduced which formally defines practical universality and introduces a distinction between algorithm-explicit and algorithm-implicit learning, providing a principled vocabulary for reasoning about universal meta-learning methods.

Abstract

Current meta-learning methods are constrained to narrow task distributions with fixed feature and label spaces, limiting applicability. Moreover, the current meta-learning literature uses key terms like "universal" and "general-purpose" inconsistently and lacks precise definitions, hindering comparability. We introduce a theoretical framework for meta-learning which formally defines practical universality and introduces a distinction between algorithm-explicit and algorithm-implicit learning, providing a principled vocabulary for reasoning about universal meta-learning methods. Guided by this framework, we present TAIL, a transformer-based algorithm-implicit meta-learner that functions across tasks with varying domains, modalities, and label configurations. TAIL features three innovations over prior transformer-based meta-learners: random projections for cross-modal feature encoding, random injection label embeddings that extrapolate to larger label spaces, and efficient inline query processing. TAIL achieves state-of-the-art performance on standard few-shot benchmarks while generalizing to unseen domains. Unlike other meta-learning methods, it also generalizes to unseen modalities, solving text classification tasks despite training exclusively on images, handles tasks with up to 20$\times$ more classes than seen during training, and provides orders-of-magnitude computational savings over prior transformer-based approaches.

Universal Algorithm-Implicit Learning

TL;DR

Abstract

more classes than seen during training, and provides orders-of-magnitude computational savings over prior transformer-based approaches.

Paper Structure (34 sections, 7 theorems, 14 equations, 4 figures, 8 tables)

This paper contains 34 sections, 7 theorems, 14 equations, 4 figures, 8 tables.

Introduction
Background and Notation
The Learning Problem
Meta-Learning
Universal Algorithm-Implicit Learning
Demonstration-Conditioned Inference
Algorithm-Explicit vs. Algorithm-Implicit Learning
Practical Universality and Universal Learning Algorithms
Few-Shot Benchmarking
A Transformer-based Universal Algorithm-Implicit Learner
Universal Feature Encoding
Random Injection Label Embedding and Classification Head
Training procedure
Theoretical properties
Related Work
...and 19 more sections

Key Result

Theorem 1.1

Let $\mathcal{X}$ be a feature space, $\mathcal{Y}$ a label space and $S = \{(x_i, y_i)\}_{i=1}^{n}$ a support dataset and $(x,y)$ a query sample. For any permutation $\sigma$ of $\mathcal{Y}$, let Then i.e. $g_\theta(S, x)$ is equivariant in distribution to the reindexing of $\mathcal{Y}$.

Figures (4)

Figure 1: Method overview. The input is encoded with a modality-appropriate pretrained encoder and then projected to a common modality-agnostic space. The labels are embedded using a randomized injection to a learnable embedding dictionary. The input and label embeddings are concatenated and form the input tokens for a transformer encoder. A linear classification head makes a prediction in label embedding space, which is then remapped to the original set of labels.
Figure 2: (a): performance degradation with increasing number of classes (1-shot setting). (b) and (c): wall clock time for 1000 test episodes as a function of task size. Two different scales show the relation to the algorithm-explicit baselines and to the meta-learning baselines. (d): memory usage during training as a function of task size, (e): wall clock time for 1000 training episodes.
Figure 3: Performance degradation with increasing number of classes (5-shot setting).
Figure 4: Validation loss curves for scheduled addition of more embeddings to the embedding dictionary.

Theorems & Definitions (20)

Definition 2.1: Learning Algorithm
Definition 3.1: Demonstration-Conditioned Inference
Definition 3.2: Universal Consistency
Definition 3.3: Learning Curve
Definition 3.4: Valid Learning Algorithm
Definition 3.5: Universal Validity
Theorem 1.1: Equivariance to label re-indexing
proof : Proof of Theorem \ref{['thm:equivariance-reindexing']}
Proposition 1.2: Unbiased gradients
Proposition 1.3: Coverage over $t$ episodes
...and 10 more

Universal Algorithm-Implicit Learning

TL;DR

Abstract

Universal Algorithm-Implicit Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (20)