Evaluating LLM-persona Generated Distributions for Decision-making

Jackie Baek; Yunhan Chen; Ziyu Chi; Will Ma

Evaluating LLM-persona Generated Distributions for Decision-making

Jackie Baek, Yunhan Chen, Ziyu Chi, Will Ma

Abstract

LLMs can generate a wealth of data, ranging from simulated personas imitating human valuations and preferences, to demand forecasts based on world knowledge. But how well do such LLM-generated distributions support downstream decision-making? For example, when pricing a new product, a firm could prompt an LLM to simulate how much consumers are willing to pay based on a product description, but how useful is the resulting distribution for optimizing the price? We refer to this approach as LLM-SAA, in which an LLM is used to construct an estimated distribution and the decision is then optimized under that distribution. In this paper, we study metrics to evaluate the quality of these LLM-generated distributions, based on the decisions they induce. Taking three canonical decision-making problems (assortment optimization, pricing, and newsvendor) as examples, we find that LLM-generated distributions are practically useful, especially in low-data regimes. We also show that decision-agnostic metrics such as Wasserstein distance can be misleading when evaluating these distributions for decision-making.

Evaluating LLM-persona Generated Distributions for Decision-making

Abstract

Paper Structure (41 sections, 5 theorems, 15 equations, 5 figures, 4 tables, 3 algorithms)

This paper contains 41 sections, 5 theorems, 15 equations, 5 figures, 4 tables, 3 algorithms.

Introduction
Contributions and approach.
Related Literature
Using LLMs to Simulate Humans
Data-driven and Robust Optimization
LLMs for Operational Decisions
Problems, Methods, and Metrics
Generation Methods and Baselines
Metrics
Decision-aware metrics, via competitive ratios.
Decision-agnostic metrics.
Assortment problem: Details and Results
Dataset.
Details of LLM generation methods and baselines.
Further problem details.
...and 26 more sections

Key Result

Lemma 1

Let $a^*,{\hat{a}}\subseteq[n]$ be assortments for which eqn:theoryPrereq holds. If $\min\{|a^*\setminus{\hat{a}}|,|{\hat{a}}\setminus a^*|\}\le 1$, then there exist sizes $(s_j)_{j\in[n]}$ and a budget $B$ such that only the sets $a^*,{\hat{a}}$ can be maximal in ${\mathcal{A}}_\theta$, and hence e

Figures (5)

Figure 1: Assortment results, displaying means across 20 generations, with 95% confidence intervals around the mean. Higher is better for the CR metrics while lower is better for the Wasserstein metric. The LLM is GPT-4o.
Figure 2: Pricing results, displaying means across 20 generations and 6 ground truths, with 95% confidence intervals around the mean. Higher is better for the CR metrics while lower is better for the Kolmogorov and Wasserstein metrics. The LLM is GPT-4o.
Figure 3: Newsvendor results across four LLM models, displaying means across 300 items with 95% confidence intervals. Higher is better for the CR metrics while lower is better for the Kolmogorov and Wasserstein metrics. The $\mathsf{AvgCR}$ uses the distribution $q \sim \mathrm{Unif}[0.01, 0.99]$.
Figure 4: Illustration of the enumeration and pruning procedure, in our special case where $n=10$ and hence there are 1023 non-empty assortments. Rows correspond to $R_\theta(\hat{a})$ and columns correspond to $R_\theta(a^\ast)$ , while each cell represents a ratio $R_\theta(\hat{a})/R_\theta(a^\ast)$. Assortments are ordered by $R_\theta(\cdot)$, with $\hat{a}_1$ and $a^\ast_1$ denoting the smallest rewards in their respective orders. We traverse only the red and green cells; light-blue cells violate \ref{['eqn:optimalityConds']} and are pruned. The boundary is illustrative and not necessarily diagonal.
Figure 5: Survival functions of the willingness-to-pay distributions for the ground truth distribution $F$, and the estimated distributions $\hat{F}$ from the Sampling and Persona sampling methods for the product Bohol. The "drops" in the functions are at technically numbers that end in 4 or 9 --- this is because in the data, respondents were asked about their willingness-to-pay premiums over the base price of PhP 44 (for an "upgraded" product), and this figure plots their total willingness-to-pay (44 + their response).

Theorems & Definitions (5)

Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5

Evaluating LLM-persona Generated Distributions for Decision-making

Abstract

Evaluating LLM-persona Generated Distributions for Decision-making

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)