Table of Contents
Fetching ...

DemoShapley: Valuation of Demonstrations for In-Context Learning

Shan Xie, Man Luo, Chadly Daniel Stern, Mengnan Du, Lu Cheng

TL;DR

This work tackles the instability of in-context learning (ICL) caused by demonstration selection and order. It introduces DemoShapley and Beta-DemoShapley, two Shapley-value-based methods that quantify each demonstration's marginal contribution by averaging effects over multiple prompt permutations, with Beta weighting to emphasize small prompts. Across multiple LLMs and tasks, these methods improve predictive performance, enhance out-of-distribution generalization, detect mislabeled data, and reduce bias, with Beta-DemoShapley particularly benefiting low-shot settings. Importantly, the approach operates at inference time without gradient access or fine-tuning, providing a principled, fair framework for robust demonstration valuation in practical ICL deployment.

Abstract

Large language models (LLMs) using in-context learning (ICL) excel in many tasks without task-specific fine-tuning. However, demonstration selection and ordering greatly impact ICL effectiveness. Focus on this issue, we propose DemoShapley, a Shapley-value based method that evaluates each demonstration's contribution by measuring its marginal effect across different prompt permutations. To further account for ICL's limited context windows and frequent low-shot settings, we introduce Beta-DemoShapley, a weighted extension that emphasizes the influence of smaller prompt sizes. Experiments on multiple benchmarks show that DemoShapley consistently outperforms existing influence-based selection strategies, while Beta-DemoShapley further improves performance in low-shot scenarios. Both methods also detect mislabeled data, enhance generalization to out-of-distribution tasks, and reduce demographic bias. Together, they provide a unified and robust framework for demonstration valuation in ICL.

DemoShapley: Valuation of Demonstrations for In-Context Learning

TL;DR

This work tackles the instability of in-context learning (ICL) caused by demonstration selection and order. It introduces DemoShapley and Beta-DemoShapley, two Shapley-value-based methods that quantify each demonstration's marginal contribution by averaging effects over multiple prompt permutations, with Beta weighting to emphasize small prompts. Across multiple LLMs and tasks, these methods improve predictive performance, enhance out-of-distribution generalization, detect mislabeled data, and reduce bias, with Beta-DemoShapley particularly benefiting low-shot settings. Importantly, the approach operates at inference time without gradient access or fine-tuning, providing a principled, fair framework for robust demonstration valuation in practical ICL deployment.

Abstract

Large language models (LLMs) using in-context learning (ICL) excel in many tasks without task-specific fine-tuning. However, demonstration selection and ordering greatly impact ICL effectiveness. Focus on this issue, we propose DemoShapley, a Shapley-value based method that evaluates each demonstration's contribution by measuring its marginal effect across different prompt permutations. To further account for ICL's limited context windows and frequent low-shot settings, we introduce Beta-DemoShapley, a weighted extension that emphasizes the influence of smaller prompt sizes. Experiments on multiple benchmarks show that DemoShapley consistently outperforms existing influence-based selection strategies, while Beta-DemoShapley further improves performance in low-shot scenarios. Both methods also detect mislabeled data, enhance generalization to out-of-distribution tasks, and reduce demographic bias. Together, they provide a unified and robust framework for demonstration valuation in ICL.

Paper Structure

This paper contains 20 sections, 9 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: We begin by selecting a candidate demonstration set and a development dataset, then define hyperparameters $K, D, C, N$ and threshold $\mu$. Starting from zero-shot learning, examples are added sequentially, with their DemoShapley values updated per prompt. Beta-DemoShapley differs by applying a pre-computed weight to the marginal contribution during updates. The algorithm iterates multiple times to ensure all candidate examples are thoroughly evaluated before concluding.
  • Figure 2: Effect of demonstration values on predictive performance across Toxi-text-3M and Adult datasets using ChatGPT-3.5-Turbo and GPT-J-6B. Columns show adding/removing high- or low-value demonstrations. DemoShapley and Beta-DemoShapley align well with actual performance: high-value samples improve accuracy, while low-value samples tend to harm it.