Efficient LLM Context Distillation
Rajesh Upadhayaya, Manish Raj Osti, Zachary Smith, Chritopher Kottmyer
TL;DR
The paper addresses adapting LLMs under tight context limits by introducing context distillation (CD), a teacher-student framework where a frozen teacher guides a LoRa-equipped student through KL-divergence-based updates. Using OPT models of four sizes and datasets for NLI and paraphrase (with corresponding OOD benchmarks), CD builds an expanded internalized context from a small number of in-context examples, enabling the student to generalize beyond the prompt window. Results show CD matches in-domain ICL performance and improves out-of-domain generalization, though it does not outperform full fine-tuning, while offering substantial reductions in data and compute—making CD a practical alternative for small datasets. The study highlights CD as an efficient, potent method for task-specific adaptation of LLMs and discusses future work, including integrating CD with other efficiency techniques and applying it to code generation tasks.
Abstract
Large Language Models (LLMs) demonstrate proficiency across diverse tasks but often require targeted adaptations for specific applications. Various methods have been proposed to facilitate this adaptation, including fewshot fine-tuning, in-context learning, and context distillation. This paper specifically investigates context distillation a method that extends the utility of task-specific examples by internalizing them, thus augmenting the example set accessible for model inference. We conduct a comparative analysis of context distillation with in-context learning (ICL) and few-shot fine-tuning (FT), aiming to ascertain the efficacy of context distillation in adapting models using minimal in-context examples. Employing matched datasets from Mobach, our experiments leverage OPT models of various sizes. The results indicate that context distillation effectively adapts models, with student models attaining comparable in-domain and out-of-domain accuracies to in-context learning. Although context distillation surpasses ICL in out-of-domain generalization, it does not achieve the performance levels of FT. However, the reduced dataset size and computational demands position context distillation as a viable alternative, especially for smaller datasets. Overall, this study presents context distillation as an efficient and potent method for customizing LLMs to specific tasks.
