A machine learning workflow to address credit default prediction

Rambod Rahmani; Marco Parola; Mario G. C. A. Cimino

A machine learning workflow to address credit default prediction

Rambod Rahmani, Marco Parola, Mario G. C. A. Cimino

TL;DR

This work tackles credit default prediction in FinTech by proposing a workflow that fuses Weight of Evidence preprocessing, ensemble learning, and multi-objective hyperparameter optimization via NSGA-II to optimize both predictive accuracy and profitability ($AUC$ and $EMP$). The methodology spans statistical, ML, and DL models, with DL ensembles (MLP/EMLP) and ensemble voting delivering the strongest performance across four public datasets. A key contribution is balancing classification performance with financial impact through $EMP$ while leveraging optimal binning and focal loss to handle data peculiarities. The approach is validated on diverse datasets and made reproducible through public code, signaling practical benefits for lenders seeking accurate and profit-aware credit risk assessment. The results indicate that integrating WoE preprocessing, ensemble strategies, and Pareto-based hyperparameter search yields robust CDP tools suitable for FinTech risk management.

Abstract

Due to the recent increase in interest in Financial Technology (FinTech), applications like credit default prediction (CDP) are gaining significant industrial and academic attention. In this regard, CDP plays a crucial role in assessing the creditworthiness of individuals and businesses, enabling lenders to make informed decisions regarding loan approvals and risk management. In this paper, we propose a workflow-based approach to improve CDP, which refers to the task of assessing the probability that a borrower will default on his or her credit obligations. The workflow consists of multiple steps, each designed to leverage the strengths of different techniques featured in machine learning pipelines and, thus best solve the CDP task. We employ a comprehensive and systematic approach starting with data preprocessing using Weight of Evidence encoding, a technique that ensures in a single-shot data scaling by removing outliers, handling missing values, and making data uniform for models working with different data types. Next, we train several families of learning models, introducing ensemble techniques to build more robust models and hyperparameter optimization via multi-objective genetic algorithms to consider both predictive accuracy and financial aspects. Our research aims at contributing to the FinTech industry in providing a tool to move toward more accurate and reliable credit risk assessment, benefiting both lenders and borrowers.

A machine learning workflow to address credit default prediction

TL;DR

and

). The methodology spans statistical, ML, and DL models, with DL ensembles (MLP/EMLP) and ensemble voting delivering the strongest performance across four public datasets. A key contribution is balancing classification performance with financial impact through

while leveraging optimal binning and focal loss to handle data peculiarities. The approach is validated on diverse datasets and made reproducible through public code, signaling practical benefits for lenders seeking accurate and profit-aware credit risk assessment. The results indicate that integrating WoE preprocessing, ensemble strategies, and Pareto-based hyperparameter search yields robust CDP tools suitable for FinTech risk management.

Abstract

Paper Structure (8 sections, 11 equations, 2 figures, 5 tables)

This paper contains 8 sections, 11 equations, 2 figures, 5 tables.

Introduction and background
Materials and methodology
Learning models
Data encoding
Hyperparameter optimization
Focal loss
Experiments and results
Conclusion

Figures (2)

Figure 1: Workflow design of the proposed method.
Figure 2: Scatter plot of the random forrest hyperparameter optimization process

A machine learning workflow to address credit default prediction

TL;DR

Abstract

A machine learning workflow to address credit default prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (2)