Best Practices for Responsible Machine Learning in Credit Scoring

Giovani Valdrighi; Athyrson M. Ribeiro; Jansen S. B. Pereira; Vitoria Guardieiro; Arthur Hendricks; Décio Miranda Filho; Juan David Nieto Garcia; Felipe F. Bocca; Thalita B. Veronese; Lucas Wanner; Marcos Medeiros Raimundo

Best Practices for Responsible Machine Learning in Credit Scoring

Giovani Valdrighi, Athyrson M. Ribeiro, Jansen S. B. Pereira, Vitoria Guardieiro, Arthur Hendricks, Décio Miranda Filho, Juan David Nieto Garcia, Felipe F. Bocca, Thalita B. Veronese, Lucas Wanner, Marcos Medeiros Raimundo

TL;DR

This paper investigates responsible machine learning for credit scoring by exploring three core areas: fairness, reject inference, and explainability. It surveys definitions and metrics for fairness, presents pre-, in-, and post-processing mitigation techniques, and experimentally compares their impact on performance and bias across multiple datasets. It also addresses sample bias via reject inference, detailing augmentation, extrapolation, and label spreading methods with an empirical evaluation. Finally, it discusses explainability approaches, including global/local explanations and counterfactuals, to enable model auditing and actionable guidance for applicants, highlighting trade-offs and practical considerations for real-world lending. Overall, the work offers a practical framework of best practices to deploy fair, transparent, and inclusive credit scoring systems while acknowledging remaining challenges and future directions.

Abstract

The widespread use of machine learning in credit scoring has brought significant advancements in risk assessment and decision-making. However, it has also raised concerns about potential biases, discrimination, and lack of transparency in these automated systems. This tutorial paper performed a non-systematic literature review to guide best practices for developing responsible machine learning models in credit scoring, focusing on fairness, reject inference, and explainability. We discuss definitions, metrics, and techniques for mitigating biases and ensuring equitable outcomes across different groups. Additionally, we address the issue of limited data representativeness by exploring reject inference methods that incorporate information from rejected loan applications. Finally, we emphasize the importance of transparency and explainability in credit models, discussing techniques that provide insights into the decision-making process and enable individuals to understand and potentially improve their creditworthiness. By adopting these best practices, financial institutions can harness the power of machine learning while upholding ethical and responsible lending practices.

Best Practices for Responsible Machine Learning in Credit Scoring

TL;DR

Abstract

Paper Structure (31 sections, 3 equations, 4 figures, 7 tables)

This paper contains 31 sections, 3 equations, 4 figures, 7 tables.

Introduction
Machine Learning for Credit
Fundamental Methods
Performance Metrics
Datasets
Credit Scoring Experiments
Fairness
Definition and metrics
Group Fairness Metrics
Subgroup Fairness Metrics
Individual Fairness Metrics
Methods
Pre-processing
In-processing
Post-processing
...and 16 more sections

Figures (4)

Figure 1: Naive pipeline a) versus pipeline that uses reject inference b). The credit was not approved for the red population on a previous classification iteration. Thus, we only have labels for the yellow population. The red clients have their samples a) discarded from the training dataset and b) inferred (clear green and blue). Thus, Classifier A will classify the next population of clients not knowing the rejected population, while Classifier B will classify the next population, with more robust results on the rejected population.
Figure 2: Global explanations calculated for Logistic model and Gradient Boosting at multiple folds using an explainability technique that has access to the model and SHAP values. Displayed features are the 10 with the highest median importance among folds. Credit scores from other institutions (EXT_SOURCE) were the most important features in almost all models.
Figure 3: Partial Dependence Plot and Individual Conditional Expectation Plot for four different models with the feature AMT_CREDIT, that is the total value of requested credit. In the ICE plot, lines are colored based in the feature AMT_INCOME_TOTAL, i.e., the client's total income.
Figure 4: SHAP and LIME local explanations for an individual that was classified with the default outcome. SHAP and LIME explanations can disagree on some occasions.

Best Practices for Responsible Machine Learning in Credit Scoring

TL;DR

Abstract

Best Practices for Responsible Machine Learning in Credit Scoring

Authors

TL;DR

Abstract

Table of Contents

Figures (4)