Table of Contents
Fetching ...

Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce

Liang Chen, Qi Liu, Wenhuan Lin, Feng Liang

Abstract

Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions -- a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3's association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading "evaluation-outcome paradox," which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.

Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce

Abstract

Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions -- a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3's association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading "evaluation-outcome paradox," which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.

Paper Structure

This paper contains 104 sections, 10 figures, 17 tables.

Figures (10)

  • Figure 1: Theoretical positioning. This work sits at the intersection of dialogue evaluation (LLM-as-Judge), conversational commerce (task success), and trust-in-automation & proxy metric theory, addressing three research gaps.
  • Figure 2: Phase 2 dimension--conversion analysis ($n = 60$). (a) Spearman $\rho$: D1 and D3 survive Bonferroni correction (**). D5 shows no detectable association. (b) Cohen's $d$: D1 and D3 show medium-to-large effects; D5 is approximately zero. Dashed line indicates medium effect threshold.
  • Figure 3: Three-layer evaluation architecture. L3 Safety (hard gate) $\to$ L2 Quality (LLM-as-Judge with conversion-informed weights) $\to$ L1 Task Success (criterion layer). D1 and D3 provide the empirically observed bridge between L2 and L1 ($\rho = 0.368$ and $\rho = 0.354$ respectively, Phase 2).
  • Figure 4: Weight scheme criterion validity comparison (Phase 2, $n = 60$). All schemes except equal weighting reach $p < 0.05$; the conversion-informed scheme achieves the highest $\rho = 0.351$ ($p = 0.006$).
  • Figure 5: Cross-phase comparison of dimension--conversion correlations. D3 is significant in both phases. D5 shifts from negative (Phase 1, confounded) to null (Phase 2, confound removed). D1 emerges as a second significant correlate with adequate power in Phase 2.
  • ...and 5 more figures