Table of Contents
Fetching ...

Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar

TL;DR

A counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style finds that larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy.

Abstract

Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.

Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

TL;DR

A counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style finds that larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy.

Abstract

Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.
Paper Structure (73 sections, 4 equations, 11 figures, 11 tables)

This paper contains 73 sections, 4 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Example illustrating gender-based disparity in Auto-QA responses. Conversation with identical content but different agent names (female left, male right) yield opposite judgments from the LLM.
  • Figure 2: Overview of the proposed fairness evaluation approach. (Left) Original transcripts are transformed into counterfactual variants by altering demographic, contextual, or linguistic attributes while preserving meaning. (Right) Each variant is evaluated by the target LLM to generate Auto-QA outputs, which are aggregated to compute fairness metrics and identify systematic disparities.
  • Figure 3: Tradeoff between fairness and accuracy.
  • Figure 4: Fairness–robustness contrast across bias dimensions.
  • Figure 5: Fairness–robustness contrast across models.
  • ...and 6 more figures