Table of Contents
Fetching ...

Designing Service Systems from Textual Evidence

Ruicheng Ao, Hongyu Chen, Siyang Gao, Hanwei Li, David Simchi-Levi

TL;DR

The algorithm, PP-LUCB, jointly decides which alternatives to evaluate and whether to request human audits, concentrating reviews where the LLM judge is least reliable, and develops an estimator combining proxy scores with inverse-propensity-weighted residuals and construct anytime-valid confidence sequences.

Abstract

Designing service systems requires selecting among alternative configurations -- choosing the best chatbot variant, the optimal routing policy, or the most effective quality control procedure. In many service systems, the primary evidence of performance quality is textual -- customer support transcripts, complaint narratives, compliance review reports -- rather than the scalar measurements assumed by classical optimization methods. Large language models (LLMs) can read such textual evidence and produce standardized quality scores, but these automated judges exhibit systematic biases that vary across alternatives and evaluation instances. Human expert review remains accurate but costly. We study how to identify the best service configuration with high confidence while minimizing expensive human audits, given that automated evaluation is cheap but biased. We formalize this as a sequential decision problem where a biased proxy score is observed for every evaluation, and a verified outcome can be acquired selectively at additional cost. We prove that LLM-only selection fails under arm-dependent bias, and that naive selective-audit estimators can be asymptotically biased. We develop an estimator combining proxy scores with inverse-propensity-weighted residuals and construct anytime-valid confidence sequences. Our algorithm, PP-LUCB, jointly decides which alternatives to evaluate and whether to request human audits, concentrating reviews where the LLM judge is least reliable. We prove correctness and establish instance-dependent cost bounds showing near-optimal efficiency. On a customer support ticket classification task, our algorithm correctly identifies the best model in 40/40 trials while achieving 90\% audit cost reduction.

Designing Service Systems from Textual Evidence

TL;DR

The algorithm, PP-LUCB, jointly decides which alternatives to evaluate and whether to request human audits, concentrating reviews where the LLM judge is least reliable, and develops an estimator combining proxy scores with inverse-propensity-weighted residuals and construct anytime-valid confidence sequences.

Abstract

Designing service systems requires selecting among alternative configurations -- choosing the best chatbot variant, the optimal routing policy, or the most effective quality control procedure. In many service systems, the primary evidence of performance quality is textual -- customer support transcripts, complaint narratives, compliance review reports -- rather than the scalar measurements assumed by classical optimization methods. Large language models (LLMs) can read such textual evidence and produce standardized quality scores, but these automated judges exhibit systematic biases that vary across alternatives and evaluation instances. Human expert review remains accurate but costly. We study how to identify the best service configuration with high confidence while minimizing expensive human audits, given that automated evaluation is cheap but biased. We formalize this as a sequential decision problem where a biased proxy score is observed for every evaluation, and a verified outcome can be acquired selectively at additional cost. We prove that LLM-only selection fails under arm-dependent bias, and that naive selective-audit estimators can be asymptotically biased. We develop an estimator combining proxy scores with inverse-propensity-weighted residuals and construct anytime-valid confidence sequences. Our algorithm, PP-LUCB, jointly decides which alternatives to evaluate and whether to request human audits, concentrating reviews where the LLM judge is least reliable. We prove correctness and establish instance-dependent cost bounds showing near-optimal efficiency. On a customer support ticket classification task, our algorithm correctly identifies the best model in 40/40 trials while achieving 90\% audit cost reduction.
Paper Structure (75 sections, 14 theorems, 158 equations, 10 figures, 17 tables, 1 algorithm)

This paper contains 75 sections, 14 theorems, 158 equations, 10 figures, 17 tables, 1 algorithm.

Key Result

Theorem 3.1

Under ass:bounded, consider the model class where instances are drawn i.i.d. $X\sim\mathcal{D}$, and for each arm $k$ the joint distribution of $(Y(k,X),F(k,X))$ is otherwise unrestricted (allowing arm- and instance-dependent bias).

Figures (10)

  • Figure 1: Empirical coverage of PP-LUCB. Error bars show 95% bootstrap confidence intervals over 1000 trials.
  • Figure 2: Performance of the five audit allocation policies. Error bars show standard deviations over 20 trials.
  • Figure 3: Audit allocation under segment heterogeneity. Oracle-Segment achieves 21.2% cost savings by concentrating audits on high-variance segments. Discrete-Instance learns segment statistics online, achieving 7.8% savings.
  • Figure 4: Audit-accuracy trade-off in MT-Bench ($K=6$ LLM models). Judge bias limits maximum accuracy, and increasing audit rate improves but does not perfect identification within the evaluation horizon.
  • Figure 5: Identification accuracy for service configurations. (a) Exact configuration identification is limited by the tied Configurations 2 and 3. (b) Design-class accuracy (Priority + gpt-5-nano) rises from 75% to 85% as the audit rate increases from 6% to 50%.
  • ...and 5 more figures

Theorems & Definitions (21)

  • Definition 1: Proxy-only algorithms
  • Theorem 3.1
  • Proposition 1
  • Proposition 2
  • Theorem 4.1
  • Theorem 5.1
  • Theorem 5.2
  • Theorem 6.1
  • Theorem 6.2
  • Example 1: Truncated normal illustration
  • ...and 11 more