Table of Contents
Fetching ...

MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

Sanchit Sinha, Yuguang Yue, Victor Soto, Mayank Kulkarni, Jianhua Lu, Aidong Zhang

TL;DR

This paper introduces MAML-en-LLM, a bi-level meta-learning framework that meta-trains LLMs via inner-task adaptation across multiple tasks and outer meta-updates with second-order gradients, aiming to produce true generalizable parameters for in-context learning. By sharing optimizer moments between inner and outer updates, it stabilizes the dual optimization and enables consolidated meta-training, yielding improved generalization to unseen domains and enhanced adaptation with limited data. Across two diverse datasets (CrossFit and UnifiedQA) and two model variants (standard and channel), MAML-en-LLM outperforms state-of-the-art MetaICL on a majority of settings and demonstrates strong few-shot adaptation capabilities. The work highlights the impact of task complexity, the number of exploration tasks, and optimizer choice on performance, motivating broader adoption of classical meta-learning techniques for LLM meta-training and in-context learning improvements.

Abstract

Adapting large language models (LLMs) to unseen tasks with in-context training samples without fine-tuning remains an important research problem. To learn a robust LLM that adapts well to unseen tasks, multiple meta-training approaches have been proposed such as MetaICL and MetaICT, which involve meta-training pre-trained LLMs on a wide variety of diverse tasks. These meta-training approaches essentially perform in-context multi-task fine-tuning and evaluate on a disjointed test set of tasks. Even though they achieve impressive performance, their goal is never to compute a truly general set of parameters. In this paper, we propose MAML-en-LLM, a novel method for meta-training LLMs, which can learn truly generalizable parameters that not only perform well on disjointed tasks but also adapts to unseen tasks. We see an average increase of 2% on unseen domains in the performance while a massive 4% improvement on adaptation performance. Furthermore, we demonstrate that MAML-en-LLM outperforms baselines in settings with limited amount of training data on both seen and unseen domains by an average of 2%. Finally, we discuss the effects of type of tasks, optimizers and task complexity, an avenue barely explored in meta-training literature. Exhaustive experiments across 7 task settings along with two data settings demonstrate that models trained with MAML-en-LLM outperform SOTA meta-training approaches.

MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

TL;DR

This paper introduces MAML-en-LLM, a bi-level meta-learning framework that meta-trains LLMs via inner-task adaptation across multiple tasks and outer meta-updates with second-order gradients, aiming to produce true generalizable parameters for in-context learning. By sharing optimizer moments between inner and outer updates, it stabilizes the dual optimization and enables consolidated meta-training, yielding improved generalization to unseen domains and enhanced adaptation with limited data. Across two diverse datasets (CrossFit and UnifiedQA) and two model variants (standard and channel), MAML-en-LLM outperforms state-of-the-art MetaICL on a majority of settings and demonstrates strong few-shot adaptation capabilities. The work highlights the impact of task complexity, the number of exploration tasks, and optimizer choice on performance, motivating broader adoption of classical meta-learning techniques for LLM meta-training and in-context learning improvements.

Abstract

Adapting large language models (LLMs) to unseen tasks with in-context training samples without fine-tuning remains an important research problem. To learn a robust LLM that adapts well to unseen tasks, multiple meta-training approaches have been proposed such as MetaICL and MetaICT, which involve meta-training pre-trained LLMs on a wide variety of diverse tasks. These meta-training approaches essentially perform in-context multi-task fine-tuning and evaluate on a disjointed test set of tasks. Even though they achieve impressive performance, their goal is never to compute a truly general set of parameters. In this paper, we propose MAML-en-LLM, a novel method for meta-training LLMs, which can learn truly generalizable parameters that not only perform well on disjointed tasks but also adapts to unseen tasks. We see an average increase of 2% on unseen domains in the performance while a massive 4% improvement on adaptation performance. Furthermore, we demonstrate that MAML-en-LLM outperforms baselines in settings with limited amount of training data on both seen and unseen domains by an average of 2%. Finally, we discuss the effects of type of tasks, optimizers and task complexity, an avenue barely explored in meta-training literature. Exhaustive experiments across 7 task settings along with two data settings demonstrate that models trained with MAML-en-LLM outperform SOTA meta-training approaches.
Paper Structure (33 sections, 10 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 10 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Visual comparison between MetaICL and MAML-en-LLM. The figure demonstrates a single model parameter ($\theta$) update step from parameters at step $i$ - $\theta_i$ to step $i+1$ - $\theta_{i+1}$. The dotted lines represent the adaptation phase and the solid lines represent the update. As MetaICL does not have an explicit adaptation phase, the update happens directly and with only a limited parameter space explored. The parameter updates for a single step is calculated using only a single task. On the other hand, MAML-en-LLM first explores a wide parameter space using multiple adapted parameters and subsequently performs the final meta-update with the second-order gradients calculated from the intermediate adapted parameters.
  • Figure 2: Schematic figure demonstrating MAML training for meta-training LLMs. MAML is a bi-level optimization framework with an inner update (adaptation) and an outer update (meta-update). In the figure, the green cells represent the input samples (prompts), the yellow cells represent task labels, blue components represent functions and purple boxes represent model parameters. Multiple task batches are utilized to compute a set of adapted parameters (equal to the number of tasks) represented by $\theta_{i}$. The outer update utilizes the adapted parameters to compute second-order gradients ($\Delta_{i}^{\theta_i}$) using a separate set of task batches. The final meta-update updates the unadapted parameter $\theta$ with an average of second-order gradients ($\Delta_{i}^{\theta_i}$) to compute the next set of updated parameters.
  • Figure 3: Example training and test prompts for Standard (Left) and Channel (Right) Models. The training procedure of Channel models learns to predict the sample, conditioned on its true label. During inference, channel models predict the target sample itself conditioned on all possible labels for the task. In this example, the task is Sentiment Analysis and hence has only two labels Positive and Negative.