Table of Contents
Fetching ...

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala

TL;DR

CARE is introduced, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors and separates quality from confounders without access to ground-truth labels.

Abstract

LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8\%. Code is released in \href{https://github.com/SprocketLab/CARE}{https://github.com/SprocketLab/CARE}.

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

TL;DR

CARE is introduced, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors and separates quality from confounders without access to ground-truth labels.

Abstract

LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8\%. Code is released in \href{https://github.com/SprocketLab/CARE}{https://github.com/SprocketLab/CARE}.
Paper Structure (81 sections, 9 theorems, 68 equations, 8 figures, 7 tables, 2 algorithms)

This paper contains 81 sections, 9 theorems, 68 equations, 8 figures, 7 tables, 2 algorithms.

Key Result

Proposition 4.1

Assume $K_{HH}=\mathrm{diag}(d_1,\dots,d_h)$ with $d_1>\cdots>d_h>0$ and the columns of $K_{JH}$ are orthogonal. Then the columns of $K_{JH}$ (equivalently, the latent directions encoded by $L$) are identifiable from $L$ up to sign and permutation. Moreover, if $K_{JH}$ is perturbed to $\tilde{K}_{J

Figures (8)

  • Figure 1: Graphical models for aggregating judge scores under different structural assumptions.(a) A naive model assumes scores reflect only a true latent quality ($Q$) and that all judges are equally reliable and represent independent views. (b) Connection-aware approach models intra-judge interactions ($J_2-J_3-J_4$), but still assumes the presence of a single latent quality score. (c) Our Confounder-aware model introduces additional latent confounders ($C$) influencing judge scores.
  • Figure 2: Interpreting CARE-SVD latent confounders on Review-5K. Heatmap reports Spearman correlations between inferred confounder scores and response features.
  • Figure 3: Interpreting CARE-Tensor latent confounders on PKU-Safer. Bars show Spearman correlations between inferred confounder posteriors and response features.
  • Figure 4: Integration results on FeedbackQA dataset with greedy judge selection guided by validation dataset performance.
  • Figure 5: Effect of the proposed heuristic in a fully Gaussian synthetic setup. We estimate the true quality variable $Q$ and report the mean squared error. The heuristic improves estimation in the non-orthogonal setting, but slightly degrades performance in the orthogonal setting where true and confounding components are disjoint.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Proposition 4.1: Identifiability and stability of latent--judge directions
  • Theorem 4.2: Finite-sample recovery for the spectral path
  • Theorem 4.3: Sample complexity for recovering $(\mu_{qc},\pi_{qc})$
  • Theorem 4.3: Exact Recovery
  • Theorem 4.4: Stability under approximate orthogonality
  • Theorem 4.5: Sample complexity for recovering $K_{JH}$
  • Theorem 4.6
  • proof
  • Corollary 4.7: Error Bound for Estimated Conditional Mean of $Q$
  • proof
  • ...and 5 more