Grimoire is All You Need for Enhancing Large Language Models
Ding Chen, Shichao Song, Qingchen Yu, Zhiyu Li, Wenjin Wang, Feiyu Xiong, Bo Tang
TL;DR
This work addresses the variability of in-context learning across language models by proposing SleIcl, a framework where a strong LLM learns from representative demonstrations to generate a grimoire that guides weaker LLMs. It introduces a formal problem setup, multiple representative-sample selection methods (KCS, HCS, HSS, RSS), two grimoire generation templates (Profound Grimoire and Simple Grimoire), and grimoire ranking via similarity or a dual-tower classifier. Empirically, SleIcl yields consistent gains for weak models across eight datasets and six LLMs, with some instances where small models even surpass GPT-4 zero-shot performance, though the best single grimoire sometimes outperforms the ranking-based approach. The findings suggest that grimoire-based guidance is a promising direction for widening the practical reach of ICL, especially for smaller models, and motivate further refinement of sample selection and ranking strategies for broader applicability.
Abstract
In-context Learning (ICL) is one of the key methods for enhancing the performance of large language models on specific tasks by providing a set of few-shot examples. However, the ICL capability of different types of models shows significant variation due to factors such as model architecture, volume of learning data, and the size of parameters. Generally, the larger the model's parameter size and the more extensive the learning data, the stronger its ICL capability. In this paper, we propose a method SLEICL that involves learning from examples using strong language models and then summarizing and transferring these learned skills to weak language models for inference and application. This ensures the stability and effectiveness of ICL. Compared to directly enabling weak language models to learn from prompt examples, SLEICL reduces the difficulty of ICL for these models. Our experiments, conducted on up to eight datasets with five language models, demonstrate that weak language models achieve consistent improvement over their own zero-shot or few-shot capabilities using the SLEICL method. Some weak language models even surpass the performance of GPT4-1106-preview (zero-shot) with the aid of SLEICL.
