Table of Contents
Fetching ...

Prompt Valuation Based on Shapley Values

Hanxi Liu, Xiaokai Mao, Haocheng Xia, Jian Lou, Jinfei Liu, Kui Ren

TL;DR

The paper tackles fair, interaction-aware evaluation of prompts in multi-prompt learning by applying Shapley values to quantify each prompt's contribution to task performance. It introduces a two-stage, learning-based approach to estimate Shapley values from prompt embeddings, enabling real-time valuation, and proves a Lipschitz-bound that links prompt similarity to Shapley-value similarity. The method is validated on SST2, AQuA, and Date with BERT and GPT-3.5-turbo, showing that a compact set of high-value prompts can achieve competitive performance and that Shapley-based ranking reliably identifies valuable prompts. This work has practical implications for prompt design and data marketplaces by offering a principled, scalable mechanism to price and select prompts for ensembles.

Abstract

Large language models (LLMs) excel on new tasks without additional training, simply by providing natural language prompts that demonstrate how the task should be performed. Prompt ensemble methods comprehensively harness the knowledge of LLMs while mitigating individual biases and errors and further enhancing performance. However, more prompts do not necessarily lead to better results, and not all prompts are beneficial. A small number of high-quality prompts often outperform many low-quality prompts. Currently, there is a lack of a suitable method for evaluating the impact of prompts on the results. In this paper, we utilize the Shapley value to fairly quantify the contributions of prompts, helping to identify beneficial or detrimental prompts, and potentially guiding prompt valuation in data markets. Through extensive experiments employing various ensemble methods and utility functions on diverse tasks, we validate the effectiveness of using the Shapley value method for prompts as it effectively distinguishes and quantifies the contributions of each prompt.

Prompt Valuation Based on Shapley Values

TL;DR

The paper tackles fair, interaction-aware evaluation of prompts in multi-prompt learning by applying Shapley values to quantify each prompt's contribution to task performance. It introduces a two-stage, learning-based approach to estimate Shapley values from prompt embeddings, enabling real-time valuation, and proves a Lipschitz-bound that links prompt similarity to Shapley-value similarity. The method is validated on SST2, AQuA, and Date with BERT and GPT-3.5-turbo, showing that a compact set of high-value prompts can achieve competitive performance and that Shapley-based ranking reliably identifies valuable prompts. This work has practical implications for prompt design and data marketplaces by offering a principled, scalable mechanism to price and select prompts for ensembles.

Abstract

Large language models (LLMs) excel on new tasks without additional training, simply by providing natural language prompts that demonstrate how the task should be performed. Prompt ensemble methods comprehensively harness the knowledge of LLMs while mitigating individual biases and errors and further enhancing performance. However, more prompts do not necessarily lead to better results, and not all prompts are beneficial. A small number of high-quality prompts often outperform many low-quality prompts. Currently, there is a lack of a suitable method for evaluating the impact of prompts on the results. In this paper, we utilize the Shapley value to fairly quantify the contributions of prompts, helping to identify beneficial or detrimental prompts, and potentially guiding prompt valuation in data markets. Through extensive experiments employing various ensemble methods and utility functions on diverse tasks, we validate the effectiveness of using the Shapley value method for prompts as it effectively distinguishes and quantifies the contributions of each prompt.
Paper Structure (32 sections, 2 theorems, 28 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 2 theorems, 28 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{U}$ be a utility function. if $\mathcal{U}$ is Lipschitz continuous with respect to some norm $||\cdot||$ on the input space with a Lipschitz constant $L$, then for any two inputs $\bm{e}_1$ and $\bm{e}_2$ corresponding to similar prompts, the absolute difference in their Shapley valu where $S$ is coalition of embeddings except $\bm{e}_i$ and $\bm{e}_j$.

Figures (7)

  • Figure 1: Examples of prompt ensembling and prompt augmentation.
  • Figure 2: Examples of coalitions.
  • Figure 3: Results of SST2 with BERT-base, as well as AQuA and Date with GPT-3.5-turbo and Manual-CoT. We add the currently most valuable prompt to the combination iteratively. For comparison, we also calculate the leave-one-out (LOO) value and combine prompts in the same manner.
  • Figure 4: Sort prompt based on Shapley values obtained by the three methods, add prompts, and calculate accuracy separately on SST2.
  • Figure 5: Results for AQuA using Manual prompt (Manual-CoT but without rationale) and Auto-CoT.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 1: Lipschitz Continuity
  • Definition 2: Beta Distribution
  • Theorem 1
  • Lemma 1
  • proof
  • proof