Misconfidence-based Demonstration Selection for LLM In-Context Learning

Shangqing Xu; Chao Zhang

Misconfidence-based Demonstration Selection for LLM In-Context Learning

Shangqing Xu, Chao Zhang

TL;DR

Misconfidence-based Demonstration Selection for LLM In-Context Learning addresses the sensitivity of in-context learning to demonstration quality. The authors propose In-Context Reflection (ICR), which uses a misconfidence metric to identify and replace demonstrations that reveal gaps between the LLM's output distribution and the task mappings, without requiring extra supervision. Across 13 subtasks from five task sets, ICR yields about a 4% average improvement and demonstrates cross-task generalization, with ablations indicating robustness to several design choices. The approach offers a scalable, label-distribution-preserving strategy for crafting few-shot prompts that enhances ICL's robustness in diverse domains.

Abstract

In-context learning with large language models (LLMs) excels at adapting to various tasks rapidly. However, its success hinges on carefully selecting demonstrations, which remains an obstacle in practice. Current approaches to this problem either rely on hard-to-acquire external supervision or require frequent interactions with LLMs, resulting in high costs. We propose a new method called In-Context Reflection (ICR) to overcome these challenges. ICR strategically selects demonstrations to reduce the discrepancy between the LLM's outputs and the actual input-output mappings. Specifically, ICR starts with a random set of initial demonstrations, then iteratively refines it. In each step, it analyzes a pool of candidate examples and identifies the ones most likely to challenge the LLM's current understanding, measured by a new metric called misconfidence. These most confusing examples are then selected to replace the less informative demonstrations in the current set. Our comprehensive evaluation across five diverse datasets encompassing 13 subtasks shows the efficacy of ICR. Compared to existing methods, ICR achieves an average performance boost of 4%, while demonstrating remarkable cross-task generalization capabilities.

Misconfidence-based Demonstration Selection for LLM In-Context Learning

TL;DR

Abstract

Paper Structure (23 sections, 2 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 2 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Problem Formulation
Method
In-Context Learning by Bridging Discrepancy
In-Context Reflection (ICR)
Experiment
Settings
Datasets
Baselines
Evaluation and Implementation Details
Same-Task Evaluation
Different-Task Evaluation
Ablation
Multiple ICR Iterations
...and 8 more sections

Figures (7)

Figure 1: A overview of our method. We aim to leverage the exact discrepancy between LLM's knowledge and task input-output mappings. Then we select demonstrations that best bridge such discrepancies.
Figure 2: An overview of In-Context Reflection (ICR). We first use an initialized prompt to get LLM prediction for each candidate, then calculate the misconfidence score $\psi$ to measure the discrepancy between LLM and task. After that, we rerank all candidates according to $\psi$, and replace part of the prompt with the top-ranked candidates, obtaining a refined prompt.
Figure 3: Different-task evaluation accuracy of ICR's prompt on GLUE's tasks. Each number shows the performance gain compared with same-task uniform sampling. On all tasks, ICR received comparable (sometimes even superior) results.
Figure 4: Performance gain of ICR on random prompt for 5 iterations. ICR produces a positive effect most times but is unstable.
Figure 5: Visualization of the result from demonstrations with different misconfidence levels on a) GLUE-MRPC b) Poem Sentiment. Demonstrations with larger misconfidence lead to better in-context performance.
...and 2 more figures

Misconfidence-based Demonstration Selection for LLM In-Context Learning

TL;DR

Abstract

Misconfidence-based Demonstration Selection for LLM In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)