Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
Liang Wang, Junpeng Wang, Chin-chia Michael Yeh, Yan Zheng, Jiarui Sun, Xiran Fan, Xin Dai, Yujie Fan, Yiwei Cai
TL;DR
This paper tackles the reliability and bias of large language models used as evaluators in payments-risk tasks, focusing on MCC-based merchant risk. It introduces a domain-aligned, five-criterion rubric with Monte Carlo scoring to quantify evaluator stability and reasoning quality, plus a consensus-deviation bias metric with self-exclusion to isolate judge-specific tendencies. The study triangulates evidence from five frontier LLMs as judges, 26 payment-expertise humans, and four years of payment-network data, revealing substantial heterogeneity in self-evaluation and cross-model bias, with some models aligning more closely to human judgments and empirical risk patterns. The framework demonstrates robust, replicable evaluation of LLM-as-a-judge systems in high-stakes financial workflows and highlights the need for bias-aware deployment practices, including calibration, ensembles, and explanation auditing. Overall, the work provides a principled methodology for assessing LLM evaluators in payment risk and underscores the value of multi-source validation for trustworthy AI-driven decision support.
Abstract
Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge's score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.
