Table of Contents
Fetching ...

Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment

Liang Wang, Junpeng Wang, Chin-chia Michael Yeh, Yan Zheng, Jiarui Sun, Xiran Fan, Xin Dai, Yujie Fan, Yiwei Cai

TL;DR

This paper tackles the reliability and bias of large language models used as evaluators in payments-risk tasks, focusing on MCC-based merchant risk. It introduces a domain-aligned, five-criterion rubric with Monte Carlo scoring to quantify evaluator stability and reasoning quality, plus a consensus-deviation bias metric with self-exclusion to isolate judge-specific tendencies. The study triangulates evidence from five frontier LLMs as judges, 26 payment-expertise humans, and four years of payment-network data, revealing substantial heterogeneity in self-evaluation and cross-model bias, with some models aligning more closely to human judgments and empirical risk patterns. The framework demonstrates robust, replicable evaluation of LLM-as-a-judge systems in high-stakes financial workflows and highlights the need for bias-aware deployment practices, including calibration, ensembles, and explanation auditing. Overall, the work provides a principled methodology for assessing LLM evaluators in payment risk and underscores the value of multi-source validation for trustworthy AI-driven decision support.

Abstract

Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge's score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.

Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment

TL;DR

This paper tackles the reliability and bias of large language models used as evaluators in payments-risk tasks, focusing on MCC-based merchant risk. It introduces a domain-aligned, five-criterion rubric with Monte Carlo scoring to quantify evaluator stability and reasoning quality, plus a consensus-deviation bias metric with self-exclusion to isolate judge-specific tendencies. The study triangulates evidence from five frontier LLMs as judges, 26 payment-expertise humans, and four years of payment-network data, revealing substantial heterogeneity in self-evaluation and cross-model bias, with some models aligning more closely to human judgments and empirical risk patterns. The framework demonstrates robust, replicable evaluation of LLM-as-a-judge systems in high-stakes financial workflows and highlights the need for bias-aware deployment practices, including calibration, ensembles, and explanation auditing. Overall, the work provides a principled methodology for assessing LLM evaluators in payment risk and underscores the value of multi-source validation for trustworthy AI-driven decision support.

Abstract

Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge's score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.
Paper Structure (80 sections, 4 theorems, 32 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 80 sections, 4 theorems, 32 equations, 10 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

For any entity $j$,

Figures (10)

  • Figure 1: Structure of the MCC Risk Rationale Prompt.Left: INPUT specifies five risk levels from very low to very high risk. Center: RATIONALE instructions require explicit coverage of five payments-risk dimensions (Business Model Stability, Regulatory Exposure, Fraud Exposure, Return Patterns, Chargeback Activity) along with 3 representative MCCs, while prohibiting numerical metrics and industry jargon.Right: OUTPUT format specifies a structured JSON array containing risk level definitions and rationales. Full prompt text appears in Appendix \ref{['app:prompt']}.
  • Figure 2: Representative MCCs. Ten MCCs spanning diverse risk levels and business models: from low-risk essential services (grocery stores, service stations) to high-risk categories (quasi-cash, telemarketing, dating services). Full descriptions appear in the Visa Merchant Data Standards Manual visa2023merchant and Mastercard Quick Reference Booklet mastercard2023merchant.
  • Figure 3: Example LLM‑Generated MCC Risk Rationales (Claude‑4.5 Sonnet). Each rationale synthesizes all five risk dimensions and selects representative MCCs. Color gradients reflect increasing risk severity. Complete outputs for all models appear in Appendix \ref{['app:llm_rationales']}.
  • Figure 4: Monte Carlo Evaluation Framework. Top: Evaluation Context specifies the evaluator role and target models, Monte Carlo Protocol defines the 10-run sampling procedure at temperature 0.7 to quantify stability, and Critical Rules prohibit output modification and clarify that runs represent repeated evaluator judgments. Bottom:Scoring Rubric provides 0–10 scales for five criteria (Accuracy, Rationale Quality, Consistency, Completeness, Practical Applicability), Procedure outlines the six-step evaluation workflow, and Required Output specifies the structured reporting format with $\mu \pm \sigma$ scores and expert synthesis. Full prompt text appears in Appendix \ref{['app:eval_prompts']}.
  • Figure 5: Example: GPT-5.1 Evaluating Claude-4.5 Sonnet Under Two Conditions. (1) Attributed Condition where the source model (Claude-4.5 Sonnet) identity is disclosed, yielding a final stabilized score of $9.02 \pm 0.12$, and (2) Anonymized Condition where the same output is presented as "Expert 4" with identity concealed, yielding $8.90 \pm 0.17$. Each criterion shows the mean score $\mu \pm \sigma$ from 10 independent Monte Carlo runs at temperature 0.7, accompanied by the evaluator's justification.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • Proposition : Zero-Sum Property
  • proof
  • Proposition : Self-Exclusion Prevents Circularity
  • proof