Table of Contents
Fetching ...

Mixtures of In-Context Learners

Giwon Hong, Emile van Krieken, Edoardo Ponti, Nikolay Malkin, Pasquale Minervini

TL;DR

Mixtures of In-Context Learners (MoICL), a novel approach to treat subsets of demonstrations as experts and learn a weighting function to merge their output distributions based on a training set, is proposed.

Abstract

In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it does not differentiate between demonstrations and quadratically increases the complexity of Transformer LLMs, exhausting the memory. As a solution, we propose Mixtures of In-Context Learners (MoICL), a novel approach to treat subsets of demonstrations as experts and learn a weighting function to merge their output distributions based on a training set. In our experiments, we show performance improvements on 5 out of 7 classification datasets compared to a set of strong baselines (up to +13\% compared to ICL and LENS). Moreover, we enhance the Pareto frontier of ICL by reducing the inference time needed to achieve the same performance with fewer demonstrations. Finally, MoICL is more robust to out-of-domain (up to +11\%), imbalanced (up to +49\%), or noisy demonstrations (up to +38\%) or can filter these out from datasets. Overall, MoICL is a more expressive approach to learning from demonstrations without exhausting the context window or memory.

Mixtures of In-Context Learners

TL;DR

Mixtures of In-Context Learners (MoICL), a novel approach to treat subsets of demonstrations as experts and learn a weighting function to merge their output distributions based on a training set, is proposed.

Abstract

In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it does not differentiate between demonstrations and quadratically increases the complexity of Transformer LLMs, exhausting the memory. As a solution, we propose Mixtures of In-Context Learners (MoICL), a novel approach to treat subsets of demonstrations as experts and learn a weighting function to merge their output distributions based on a training set. In our experiments, we show performance improvements on 5 out of 7 classification datasets compared to a set of strong baselines (up to +13\% compared to ICL and LENS). Moreover, we enhance the Pareto frontier of ICL by reducing the inference time needed to achieve the same performance with fewer demonstrations. Finally, MoICL is more robust to out-of-domain (up to +11\%), imbalanced (up to +49\%), or noisy demonstrations (up to +38\%) or can filter these out from datasets. Overall, MoICL is a more expressive approach to learning from demonstrations without exhausting the context window or memory.

Paper Structure

This paper contains 40 sections, 5 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: A Mixture of In-Context Learners (MoICL) first partitions a set of demonstrations$D$ in $k$partitions to create $k$experts trained via in-context learning, and then combines their next-token predictions via a trainable weighting function.
  • Figure 2: Accuracy according to the number of demonstrations per subset on TweetEval offensive dataset. The shaded area represents the standard deviation. We also compare mixing logits to mixing probabilities; see \ref{['appendix:mixing']}.
  • Figure 3: Visualisation of the tuned weights when (a) 50% and (b) 70% of demonstrations are OOD. The y-axis indicates the weights, whereas the x-axis represents the index of demonstrations sorted in ascending order (across five different seeds). Blue bars correspond to in-domain (ID) demonstrations, and red bars correspond to out-of-domain (OOD) demonstrations.
  • Figure 4: Resilience of ICL to adding noisy demonstration. We report the EM based on the number of noised demonstrations out of the total 12 demonstrations in NQ. For the case of scalar, we also present the average weights of standard and noisy demonstrations as (standard, noisy).
  • Figure 5: An analysis of MoICL's data efficiency on the TweetEval offensive/hate test set using Llama-3-8B-Instruct. Concat-based ICL concatenated all available demonstrations (x-axis), though more than 160 exceeded the context length. MoICL Scalar Weights ($k=n$) assigned the designated demonstrations to the experts while using the remaining available demonstrations for fine-tuning.
  • ...and 1 more figures