Table of Contents
Fetching ...

MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning

Yichuan Li, Xiyao Ma, Sixing Lu, Kyumin Lee, Xiaohu Liu, Chenlei Guo

TL;DR

MEND tackles the efficiency bottleneck of in-context learning by learning to distill lengthy demonstrations into compact vectors via a meta-trained distillation module. It aligns the distilled prompts with full demonstrations using KL-based knowledge distillation, enabling the LLM to behave as if conditioning on the originals while minimizing computation. The approach employs a two-stage training regime—meta-distillation pretraining on standard text data and meta-distillation finetuning on ICL tasks—to acquire transferable distillation knowledge. Across seven MetaICL partitions and multiple model families, MEND matches or surpasses Vanilla ICL and outperforms prior distillation methods, while delivering substantial FLOPs reductions and faster inference. This work paves the way for scalable, efficient deployment of large language models in real-world ICL settings.

Abstract

Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models

MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning

TL;DR

MEND tackles the efficiency bottleneck of in-context learning by learning to distill lengthy demonstrations into compact vectors via a meta-trained distillation module. It aligns the distilled prompts with full demonstrations using KL-based knowledge distillation, enabling the LLM to behave as if conditioning on the originals while minimizing computation. The approach employs a two-stage training regime—meta-distillation pretraining on standard text data and meta-distillation finetuning on ICL tasks—to acquire transferable distillation knowledge. Across seven MetaICL partitions and multiple model families, MEND matches or surpasses Vanilla ICL and outperforms prior distillation methods, while delivering substantial FLOPs reductions and faster inference. This work paves the way for scalable, efficient deployment of large language models in real-world ICL settings.

Abstract

Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models
Paper Structure (45 sections, 4 equations, 8 figures, 7 tables)

This paper contains 45 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Vanilla ICL method utilizes the concatenation of demonstrations and test input to generate the output. In contrast, PromptTuning and HyperNetworks employ distilled vectors in place of the full demonstrations. The length of these distilled vectors is significantly shorter than that of the demonstrations, contributing to a more compact and efficient in-context learning for LLM.
  • Figure 2: Overview of MEND. MEND takes as input demonstrations and distillation placeholder, outputs distillation vectors. To capture the meta-knowledge of demonstration distillation, MEND is trained in two stages: meta-distillation pretraining and fientuning.
  • Figure 3: Efficient Analysis of In-Context Learning at Inference Time. GPT2-large (774M) and GPT2-XL(1.5B) are evaluated on the same task with batch size 1. The context length for both PromptTuning and MEND is 100, while for Vanilla ICL varies on the partitions. (Class$\rightarrow$Class is 469, HR$\rightarrow$LR is 652, QA$\rightarrow$QA is 639, non_NLI$\rightarrow$NLI is 848, and non_Para$\rightarrow$Para is 818).
  • Figure 4: Performance with different demonstration distillation ratio. The distillation ratio is the ratio of the number of demonstration examples to the length of the distillation.
  • Figure 5: Attention visualization. The left red surrounded x-axis denotes either the demonstrations (Vanilla ICL) or the distilled vectors (MEND) and the other part of x-axis are the tokens from the test input. The y-axis corresponds to the first token of the output word.
  • ...and 3 more figures