Supervised Knowledge Makes Large Language Models Better In-context Learners
Linyi Yang, Shuibai Zhang, Zhuohao Yu, Guangsheng Bao, Yidong Wang, Jindong Wang, Ruochen Xu, Wei Ye, Xing Xie, Weizhu Chen, Yue Zhang
TL;DR
This paper addresses the limited OOD generalization and factuality of large language models in in-context learning by introducing SuperContext, a plug-in framework that injects supervised knowledge from task-specific discriminative models into LLM prompts. The approach leverages r_i, the SLM’s prediction and confidence, inserted between input-output pairs and optionally an interpretation prompt, enabling LLMs to benefit from task-specific knowledge during inference. Across GLUE-X and SQuAD 2.0 benchmarks, SuperContext improves generalization and reduces hallucinations, surpassing both pure LLMs and SLMs in several zero-shot and few-shot settings and even approaching or exceeding some fine-tuned baselines. The work provides extensive resources (datasets, prompts, checkpoints, outputs) and demonstrates the practical significance of integrating discriminative models into LLM inference to achieve more reliable, cost-efficient NLP systems.
Abstract
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.
