Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and Challenges

Parameswaran Kamalaruban; Yulu Pi; Stuart Burrell; Eleanor Drage; Piotr Skalski; Jason Wong; David Sutton

Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and Challenges

Parameswaran Kamalaruban, Yulu Pi, Stuart Burrell, Eleanor Drage, Piotr Skalski, Jason Wong, David Sutton

TL;DR

This work addresses fairness in transaction fraud detection by performing the first algorithmic bias audit in this domain using public synthetic datasets. It presents a framework that categorizes and evaluates a wide range of group fairness metrics, including both threshold-dependent and threshold-independent measures, and examines the impact of normalization to account for severe class imbalance. Empirically, LightGBM models trained with standard ERM and fairness-through-unawareness reveal that protection-related metrics can be unbiased at a fixed FP rate, while quality-of-service metrics exhibit bias once normalized, with notable biases in high-precision regimes. The study highlights socio-technical challenges and advocates for cardholder-level fairness evaluation and nuanced metric choices that balance fraud protection with user experience, laying groundwork for future domain-specific fairness methods in transaction fraud systems.

Abstract

Ensuring fairness in transaction fraud detection models is vital due to the potential harms and legal implications of biased decision-making. Despite extensive research on algorithmic fairness, there is a notable gap in the study of bias in fraud detection models, mainly due to the field's unique challenges. These challenges include the need for fairness metrics that account for fraud data's imbalanced nature and the tradeoff between fraud protection and service quality. To address this gap, we present a comprehensive fairness evaluation of transaction fraud models using public synthetic datasets, marking the first algorithmic bias audit in this domain. Our findings reveal three critical insights: (1) Certain fairness metrics expose significant bias only after normalization, highlighting the impact of class imbalance. (2) Bias is significant in both service quality-related parity metrics and fraud protection-related parity metrics. (3) The fairness through unawareness approach, which involved removing sensitive attributes such as gender, does not improve bias mitigation within these datasets, likely due to the presence of correlated proxies. We also discuss socio-technical fairness-related challenges in transaction fraud models. These insights underscore the need for a nuanced approach to fairness in fraud detection, balancing protection and service quality, and moving beyond simple bias mitigation strategies. Future work must focus on refining fairness metrics and developing methods tailored to the unique complexities of the transaction fraud domain.

Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and Challenges

TL;DR

Abstract

Paper Structure (56 sections, 7 equations, 6 figures, 3 tables)

This paper contains 56 sections, 7 equations, 6 figures, 3 tables.

Introduction
Related Work
Algorithmic Fairness
Socio-Technical Fairness
Fairness in Financial AI
Algorithmic Fairness in Transaction Fraud Models
Fairness Metrics
Classifier threshold dependent metrics.
Recall Parity or Equal Opportunity hardt2016equality.
Negative Predictive Value Parity.
Precision Parity chouldechova2017fair.
False Positive Rate Parity chouldechova2017fair.
F1 Score Parity.
Equalized Odds hardt2016equality.
Demographic Parity dwork2012fairness.
...and 41 more sections

Figures (6)

Figure 1: Transaction-level fairness evaluation of LightGBM classifier models trained on Sparkov and IBMCard datasets using standard ERM and unaware ERM approaches. Simple parity metrics are computed for a global FP ratio of 5.0. Note that AUC parity metrics are calculated independently of this fixed FP ratio. Each parity metric's normalized value is shown as an unfilled shape (circle/square) along the same vertical line. A significance bias threshold of 0.05 is chosen in accordance with existing literature han2023ffb. For both datasets, significant bias was not observed for protection-related metrics with a global FP ratio of 5.0, even after normalization. However, significant bias was evident for QoS-related metrics with the same global FP ratio after normalization. Similarly, after normalization, significant bias was observed for combined protection and QoS-related metrics, except for ROC AUC parity.
Figure 2: Sparkov dataset
Figure 3: IBMCard dataset
Figure 5: Transaction-level utility evaluation of LightGBM classifier models trained on Sparkov and IBMCard datasets using standard ERM and unaware ERM approaches. Simple metrics are computed for a global FP ratio of 5.0. Note that AUC metrics are calculated independently of this fixed FP ratio.
Figure 6: Sparkov dataset
...and 1 more figures

Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and Challenges

TL;DR

Abstract

Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and Challenges

Authors

TL;DR

Abstract

Table of Contents

Figures (6)