Table of Contents
Fetching ...

Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation

HyunJoon Jung, William Na

Abstract

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.

Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation

Abstract

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.

Paper Structure

This paper contains 34 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview. 32 persona-based agent judges evaluate conversational AI across 960 sessions. A Turing test confirms agent scores are indistinguishable from human scores ($p = 0.38$). The scaling analysis reveals a score--coverage dissociation: scoring reliability improves logarithmically while issue discovery follows a sublinear power law ($b \approx 0.69$). An ablation shows structured persona conditioning is necessary.
  • Figure 2: The score--coverage dissociation. Left: Scoring reliability (ICC) improves logarithmically. Right: Issue discovery follows a sublinear power law ($b = 0.69$, $R^2 = 0.999$; shaded band: $\theta = 0.60$--$0.70$). Gray diamonds show raw observation counts before semantic deduplication; the 75% gap at $N\!=\!32$ illustrates the importance of deduplication for measuring true discovery. Both dimensions exhibit diminishing returns, but scores saturate $\sim\!2\times$ faster.
  • Figure 3: Left: distribution of "matches my experience" ratings (1--5). 45% of ratings are $\geq 4$; mean = 3.07. Right: 41% of participants reported the AI found issues they missed, while only 19% found issues the AI missed---agent judges provide complementary, not redundant, coverage.
  • Figure 4: Left: Sublinear discovery scaling with head/torso/tail zones annotated. Marginal contribution per judge decreases from 4.7 (N=1--4) to 2.0 (N=16--32), consistent with diminishing returns. The confidence band spans $\theta = 0.60$--$0.70$. Right: Hypothesized power law distribution of the finding space. Critical findings (head) have high per-session discovery probability and are found by small panels; moderate findings (torso) require more diverse perspectives; corner cases (tail) are reached only by large panels. This distribution explains the sublinear exponent $b \approx 0.69$.
  • Figure 5: Power law exponent $b$ as a function of cosine similarity threshold $\theta$. Across all seven thresholds ($\theta = 0.50$--$0.80$), the exponent remains below 1.0, confirming that the sublinear conclusion is not an artifact of a particular deduplication setting. The recommended range ($\theta = 0.60$--$0.70$, shaded) is chosen by cluster quality inspection: at $\theta = 0.65$, semantically equivalent insights from different agents are correctly grouped while distinct issues remain separated.
  • ...and 2 more figures