Table of Contents
Fetching ...

GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks

Shivanshu Gupta, Clemens Rosenbaum, Ethan R. Elenberg

TL;DR

This work tackles the sensitivity of in-context learning (ICL) to example selection by introducing Example Gisting, which trains gist encoders via an attention bottleneck to compress salient input information into a small number of gist tokens. The resulting GistScore metric ranks candidate in-context examples by their relevance to the test input, with two training regimes: dataset-specific fine-tuning (GS[F]) and a multi-task, training-free approach (GS[M]). Across 21 datasets and 8 LLMs, GS[F] achieves state-of-the-art ICL gains (up to about 21 points over SBERT/BM25 and ~5 points over prior trained methods), while GS[M] generalizes well to held-out tasks and templates and runs orders of magnitude faster than BSR. The set-extension Set-GS further boosts performance on compositional tasks like semantic parsing, and the multi-task model offers a practical, scalable ICL pipeline that can replace traditional retrieval baselines in many settings.

Abstract

In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts comprising a few task examples. However, ICL performance can be critically sensitive to the choice of examples. To dynamically select the best examples for every test input, we propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning with an attention bottleneck between the inputs and outputs. These gist models form the basis for GistScore, a novel metric for scoring and selecting informative examples. Further, we experiment with two variations: (1) fine-tuning gist models for each dataset and (2) multi-task training a single model on a large collection of datasets. The latter can be used for new tasks out-of-the-box, enabling a training-free ICL pipeline. Evaluations with 21 datasets spanning 9 tasks and 8 diverse LLMs show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers and 5% over the best prior methods. Further, our multi-task model generalizes well to new tasks, datasets, and prompt templates. Selection using this model matches or outperforms prior methods while being three orders of magnitude faster than the strongest training-free baseline.

GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks

TL;DR

This work tackles the sensitivity of in-context learning (ICL) to example selection by introducing Example Gisting, which trains gist encoders via an attention bottleneck to compress salient input information into a small number of gist tokens. The resulting GistScore metric ranks candidate in-context examples by their relevance to the test input, with two training regimes: dataset-specific fine-tuning (GS[F]) and a multi-task, training-free approach (GS[M]). Across 21 datasets and 8 LLMs, GS[F] achieves state-of-the-art ICL gains (up to about 21 points over SBERT/BM25 and ~5 points over prior trained methods), while GS[M] generalizes well to held-out tasks and templates and runs orders of magnitude faster than BSR. The set-extension Set-GS further boosts performance on compositional tasks like semantic parsing, and the multi-task model offers a practical, scalable ICL pipeline that can replace traditional retrieval baselines in many settings.

Abstract

In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts comprising a few task examples. However, ICL performance can be critically sensitive to the choice of examples. To dynamically select the best examples for every test input, we propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning with an attention bottleneck between the inputs and outputs. These gist models form the basis for GistScore, a novel metric for scoring and selecting informative examples. Further, we experiment with two variations: (1) fine-tuning gist models for each dataset and (2) multi-task training a single model on a large collection of datasets. The latter can be used for new tasks out-of-the-box, enabling a training-free ICL pipeline. Evaluations with 21 datasets spanning 9 tasks and 8 diverse LLMs show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers and 5% over the best prior methods. Further, our multi-task model generalizes well to new tasks, datasets, and prompt templates. Selection using this model matches or outperforms prior methods while being three orders of magnitude faster than the strongest training-free baseline.
Paper Structure (23 sections, 9 equations, 13 figures, 16 tables)

This paper contains 23 sections, 9 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Top Example Gisting involves supervised training with an attention masking bottleneck. Here, gist tokens (red) may attend to example inputs (black) and the task instruction (yellow, optional), however, the output (blue) may only attend to the gist tokens. Training with such a bottleneck encourages concise, task-specifc encodings of salient aspects of inputs. Further, with multi-task training the model can be applied to new tasks out-of-the-box. Bottom Retrieval of the candidate examples with the highest GistScore with the test input (Task instruction omitted for brevity).
  • Figure 2: Single-token GistScore v/s BSR and trained baselines: EPR and CEIL with GPT-Neo-2.7B (Top) and LLM-R with LLaMA-7B (Bottom). All numbers are absolute gain in 8-shot ICL performance over SBERT except EPR and CEIL on MNLI, SST5, MRPC, and CSQA which are with 50 in-context examples. Both GS[F] and GS[M] consistently outperform all baselines, with GS[F] performing the best. Semantic parsing is an exception as it requires additional gist tokens and set-selection (see Table \ref{['tab:semparse']}).
  • Figure 3: Comparison of training-free methods on held-out datasets. GS[M] is able to generalize out-of-the-box to held-out datasets, domains (e.g., tweet, medical), and languages, significantly outperforming both off-the-shelf retrievers as well as the stronger but slower BSR.
  • Figure 4: LeftGS[F] and GS[M] consistently outperform baselines across varying number of in-context examples, requiring just 2 examples to surpass 8-shot ICL using SBERT and BM25. Right Due to their complex compositional nature, Semantic Parsing datasets benefit from additional gist tokens and set-selection. With 15 tokens, Set-GS[M] matches the average 8-shot semantic parsing ICL performance of Set-BSR, while Set-GS[F] vastly outperforms it. See Table \ref{['tab:semparse']} for trained baselines and Table \ref{['tab:set-all']} for complete results.
  • Figure 5: Example selection using GistScore (GS[M, 1]) is up to four (three) orders of magnitude faster than BSR (BM25), and scales well with the number of gist tokens.
  • ...and 8 more figures