Table of Contents
Fetching ...

SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer Shots

Weixing Wang, Haojin Yang, Christoph Meinel

TL;DR

SeCoKD is presented, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation, thereby increasing the utilization of a single demonstration, and brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.

Abstract

Previous studies have shown that demonstrations can significantly help Large Language Models (LLMs ) perform better on the given tasks. However, this so-called In-Context Learning ( ICL ) ability is very sensitive to the presenting context, and often dozens of demonstrations are needed. In this work, we investigate if we can reduce the shot number while still maintaining a competitive performance. We present SeCoKD, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation, thereby increasing the utilization of a single demonstration. We experiment with the SeCoKD across three LLMs and six benchmarks focusing mainly on reasoning tasks. Results show that our method outperforms the base model and Supervised Fine-tuning ( SFT ), especially in zero-shot and one-shot settings by 30% and 10%, respectively. Moreover, SeCoKD brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.

SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer Shots

TL;DR

SeCoKD is presented, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation, thereby increasing the utilization of a single demonstration, and brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.

Abstract

Previous studies have shown that demonstrations can significantly help Large Language Models (LLMs ) perform better on the given tasks. However, this so-called In-Context Learning ( ICL ) ability is very sensitive to the presenting context, and often dozens of demonstrations are needed. In this work, we investigate if we can reduce the shot number while still maintaining a competitive performance. We present SeCoKD, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation, thereby increasing the utilization of a single demonstration. We experiment with the SeCoKD across three LLMs and six benchmarks focusing mainly on reasoning tasks. Results show that our method outperforms the base model and Supervised Fine-tuning ( SFT ), especially in zero-shot and one-shot settings by 30% and 10%, respectively. Moreover, SeCoKD brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.
Paper Structure (23 sections, 4 equations, 8 figures, 4 tables)

This paper contains 23 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of the SeCoKD framework. The teacher model first generates high-quality rationale and answers for a query through 8-shot ICL. Then a student is trained using fewer demonstrations and the teacher's output.
  • Figure 2: Comparison of 4 methods with different shot numbers. The X-axis represents the number of demonstrations used for inference. The Y-axis shows the average accuracy of all six tasks. SeCoKD significantly outperforms the other two baselines in zero-shot and one-shot scenarios.
  • Figure 3: Few-Shot performance on each task. The X-axis represents the number of demonstrations used for inference. Our methods SeCoKD-S and SeCoKD-M perform much better in zero-shot and one-shot compared to the two baselines.
  • Figure 4: Cross-task tests of one-shot performance on different benchmarks. The Y-axis is the training task, and the X-axis represents the testing task. The cell value represents the absolute accuracy and we use the red boxes to highlight the best score in a column. For example, the top right cell shows the evaluation accuracy on the COIN-FLIP task when the model is trained on the ARC-C task.
  • Figure 5: Cross-task evaluation of one-shot performance across different benchmarks. The Y-axis indicates the training task, while the X-axis represents the testing task. To assess the impact of the training method on model performance, we subtract the baseline accuracy from the accuracy achieved post-training. A red cell color indicates that the trained model outperforms the base model, whereas blue cells signify a decline in performance after training.
  • ...and 3 more figures