Table of Contents
Fetching ...

Supervised Knowledge Makes Large Language Models Better In-context Learners

Linyi Yang, Shuibai Zhang, Zhuohao Yu, Guangsheng Bao, Yidong Wang, Jindong Wang, Ruochen Xu, Wei Ye, Xing Xie, Weizhu Chen, Yue Zhang

TL;DR

This paper addresses the limited OOD generalization and factuality of large language models in in-context learning by introducing SuperContext, a plug-in framework that injects supervised knowledge from task-specific discriminative models into LLM prompts. The approach leverages r_i, the SLM’s prediction and confidence, inserted between input-output pairs and optionally an interpretation prompt, enabling LLMs to benefit from task-specific knowledge during inference. Across GLUE-X and SQuAD 2.0 benchmarks, SuperContext improves generalization and reduces hallucinations, surpassing both pure LLMs and SLMs in several zero-shot and few-shot settings and even approaching or exceeding some fine-tuned baselines. The work provides extensive resources (datasets, prompts, checkpoints, outputs) and demonstrates the practical significance of integrating discriminative models into LLM inference to achieve more reliable, cost-efficient NLP systems.

Abstract

Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.

Supervised Knowledge Makes Large Language Models Better In-context Learners

TL;DR

This paper addresses the limited OOD generalization and factuality of large language models in in-context learning by introducing SuperContext, a plug-in framework that injects supervised knowledge from task-specific discriminative models into LLM prompts. The approach leverages r_i, the SLM’s prediction and confidence, inserted between input-output pairs and optionally an interpretation prompt, enabling LLMs to benefit from task-specific knowledge during inference. Across GLUE-X and SQuAD 2.0 benchmarks, SuperContext improves generalization and reduces hallucinations, surpassing both pure LLMs and SLMs in several zero-shot and few-shot settings and even approaching or exceeding some fine-tuned baselines. The work provides extensive resources (datasets, prompts, checkpoints, outputs) and demonstrates the practical significance of integrating discriminative models into LLM inference to achieve more reliable, cost-efficient NLP systems.

Abstract

Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.
Paper Structure (20 sections, 2 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 2 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: We denote ($x_i, y_i$) as a question-answer pair and our receipt $r_i$ is inserted between the question-answer pair. Supervised knowledge plays a key role in improving OOD generalizability and factuality of LLMs. While the following two analysis tasks aim to explain why our method outperforms the traditional in-context learning method.
  • Figure 2: Illustration of prompt designs, where the supervised knowledge provided by the discriminative model is defined as $r_i$, and the optional interpretation prompt is denoted as $s_i$.
  • Figure 3: Counting the times of 16-shot in-context examples that have been considered as the influential examples over 8 NLU tasks, sorting by order of occurrence.
  • Figure 4: The correlation between the SLM confidence and LLM performance evaluated on the GLUE-X benchmark. The dark green line represents the normalized performance of LLMs using SuperContext corresponding with the right y-axis while the light green bar indicates the volume of instances with the specific confidence interval corresponding with the left y-axis.
  • Figure 5: The calibration laws of ELECTRA-large and InstructGPT between the confidence and performance evaluated on the GLUE-X benchmark. The dark green line represents the LLMs' performance using SuperContext corresponding with the right y-axis while the light green bar indicates the volume of instances with the specific confidence interval corresponding with the left y-axis.