Table of Contents
Fetching ...

Efficient LLM Context Distillation

Rajesh Upadhayaya, Manish Raj Osti, Zachary Smith, Chritopher Kottmyer

TL;DR

The paper addresses adapting LLMs under tight context limits by introducing context distillation (CD), a teacher-student framework where a frozen teacher guides a LoRa-equipped student through KL-divergence-based updates. Using OPT models of four sizes and datasets for NLI and paraphrase (with corresponding OOD benchmarks), CD builds an expanded internalized context from a small number of in-context examples, enabling the student to generalize beyond the prompt window. Results show CD matches in-domain ICL performance and improves out-of-domain generalization, though it does not outperform full fine-tuning, while offering substantial reductions in data and compute—making CD a practical alternative for small datasets. The study highlights CD as an efficient, potent method for task-specific adaptation of LLMs and discusses future work, including integrating CD with other efficiency techniques and applying it to code generation tasks.

Abstract

Large Language Models (LLMs) demonstrate proficiency across diverse tasks but often require targeted adaptations for specific applications. Various methods have been proposed to facilitate this adaptation, including fewshot fine-tuning, in-context learning, and context distillation. This paper specifically investigates context distillation a method that extends the utility of task-specific examples by internalizing them, thus augmenting the example set accessible for model inference. We conduct a comparative analysis of context distillation with in-context learning (ICL) and few-shot fine-tuning (FT), aiming to ascertain the efficacy of context distillation in adapting models using minimal in-context examples. Employing matched datasets from Mobach, our experiments leverage OPT models of various sizes. The results indicate that context distillation effectively adapts models, with student models attaining comparable in-domain and out-of-domain accuracies to in-context learning. Although context distillation surpasses ICL in out-of-domain generalization, it does not achieve the performance levels of FT. However, the reduced dataset size and computational demands position context distillation as a viable alternative, especially for smaller datasets. Overall, this study presents context distillation as an efficient and potent method for customizing LLMs to specific tasks.

Efficient LLM Context Distillation

TL;DR

The paper addresses adapting LLMs under tight context limits by introducing context distillation (CD), a teacher-student framework where a frozen teacher guides a LoRa-equipped student through KL-divergence-based updates. Using OPT models of four sizes and datasets for NLI and paraphrase (with corresponding OOD benchmarks), CD builds an expanded internalized context from a small number of in-context examples, enabling the student to generalize beyond the prompt window. Results show CD matches in-domain ICL performance and improves out-of-domain generalization, though it does not outperform full fine-tuning, while offering substantial reductions in data and compute—making CD a practical alternative for small datasets. The study highlights CD as an efficient, potent method for task-specific adaptation of LLMs and discusses future work, including integrating CD with other efficiency techniques and applying it to code generation tasks.

Abstract

Large Language Models (LLMs) demonstrate proficiency across diverse tasks but often require targeted adaptations for specific applications. Various methods have been proposed to facilitate this adaptation, including fewshot fine-tuning, in-context learning, and context distillation. This paper specifically investigates context distillation a method that extends the utility of task-specific examples by internalizing them, thus augmenting the example set accessible for model inference. We conduct a comparative analysis of context distillation with in-context learning (ICL) and few-shot fine-tuning (FT), aiming to ascertain the efficacy of context distillation in adapting models using minimal in-context examples. Employing matched datasets from Mobach, our experiments leverage OPT models of various sizes. The results indicate that context distillation effectively adapts models, with student models attaining comparable in-domain and out-of-domain accuracies to in-context learning. Although context distillation surpasses ICL in out-of-domain generalization, it does not achieve the performance levels of FT. However, the reduced dataset size and computational demands position context distillation as a viable alternative, especially for smaller datasets. Overall, this study presents context distillation as an efficient and potent method for customizing LLMs to specific tasks.
Paper Structure (14 sections, 2 figures, 3 tables)

This paper contains 14 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Sub-figure (a) comprises the results of ICL as published by Mosbach et al. mosbach2023fewshot. Sub-figure (b) are the results of CD from our experiments. Both figures represent the scenario using $n=16$ context examples.
  • Figure 2: Exploring the effect of CD on model quality. Shown are the accuracy of the context distilled student model.