Table of Contents
Fetching ...

Context-Aware Meta-Learning

Christopher Fifty, Dennis Duan, Ronald G. Junkins, Ehsan Amid, Jure Leskovec, Christopher Re, Sebastian Thrun

TL;DR

This work proposes a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning, and leverages a frozen pre-trained feature extractor, and analogous to in-context learning.

Abstract

Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts visual meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks. Our code is available at https://github.com/cfifty/CAML.

Context-Aware Meta-Learning

TL;DR

This work proposes a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning, and leverages a frozen pre-trained feature extractor, and analogous to in-context learning.

Abstract

Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts visual meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks. Our code is available at https://github.com/cfifty/CAML.
Paper Structure (23 sections, 8 theorems, 8 equations, 5 figures, 12 tables)

This paper contains 23 sections, 8 theorems, 8 equations, 5 figures, 12 tables.

Key Result

Theorem 1

The set of class embeddings $\{\phi_j\}_{j=1}^d$$\forall j$, $1 \leq j \leq d$ that maximizes $p_{\psi_j}(X=j)$ is necessarily an ELMES.

Figures (5)

  • Figure 1: Overview of CAML. Query and support set images are encoded with a pre-trained feature extractor and then concatenated with their corresponding ELMES label embeddings. We feed the resulting sequence of concatenated vectors into a non-casual sequence model and extract the query vector from the output sequence to predict its class.
  • Figure 2: Two sample tasks over the same support images but utilizing different criteria to define classes. The nature of the query image informs the task being presented, e.g. classification by object (top) vs. classification by texture (bottom). For both tasks, the output of the non-causal sequence model provides better separation among class representations than CLIP embeddings and groups the query representation with the proper task, even when projected into 2D space by PCA.
  • Figure 3: A visualization of a $d=4$ ELMES in $\mathbb{R}^3$. Observe the endpoints of the vectors of an ELMES lie on the vertices of a centered regular tetrahedron.
  • Figure 4: t-SNE projections of different image embeddings of various benchmark datasets with embeddings colored class identity. We see ViT-huge trained with Laion-2b better separates the Aircraft dataset than does ViT-base trained with CLIP. However, both image encoders are unable to separate ChestX.
  • Figure 5: (Left) histogram of the correct class probability for the example presented in \ref{['fig:analysis1']} after permuting the assignment of labels to support-set images for all 120.0 permutations of the 5-way-1-shot task. (Right) histogram of the average standard deviation of all 120.0 permutations of the 5-way-1-shot task for 1000.0 samples from mini-ImageNet.

Theorems & Definitions (33)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Remark 1
  • ...and 23 more