Table of Contents
Fetching ...

Interpretable Credit Default Prediction with Ensemble Learning and SHAP

Shiqi Yang, Ziyi Huang, Wengran Xiao, Xinyu Shen

TL;DR

The paper tackles credit default prediction in a high-dimensional, imbalanced real-world dataset. It systematically compares multiple classifiers, including XGBoost, LightGBM, and CatBoost, using standardized preprocessing, SMOTE balancing, and SHAP-based interpretability. SHAP analysis identifies EXT_SOURCE_3 and EXT_SOURCE_2 (external credit scores) as dominant predictors, underscoring the value of external data fusion. The results show gradient-boosting ensembles achieve superior accuracy, precision, and recall, offering a practical, interpretable framework for automated credit risk control.

Abstract

This study focuses on the problem of credit default prediction, builds a modeling framework based on machine learning, and conducts comparative experiments on a variety of mainstream classification algorithms. Through preprocessing, feature engineering, and model training of the Home Credit dataset, the performance of multiple models including logistic regression, random forest, XGBoost, LightGBM, etc. in terms of accuracy, precision, and recall is evaluated. The results show that the ensemble learning method has obvious advantages in predictive performance, especially in dealing with complex nonlinear relationships between features and data imbalance problems. It shows strong robustness. At the same time, the SHAP method is used to analyze the importance and dependency of features, and it is found that the external credit score variable plays a dominant role in model decision making, which helps to improve the model's interpretability and practical application value. The research results provide effective reference and technical support for the intelligent development of credit risk control systems.

Interpretable Credit Default Prediction with Ensemble Learning and SHAP

TL;DR

The paper tackles credit default prediction in a high-dimensional, imbalanced real-world dataset. It systematically compares multiple classifiers, including XGBoost, LightGBM, and CatBoost, using standardized preprocessing, SMOTE balancing, and SHAP-based interpretability. SHAP analysis identifies EXT_SOURCE_3 and EXT_SOURCE_2 (external credit scores) as dominant predictors, underscoring the value of external data fusion. The results show gradient-boosting ensembles achieve superior accuracy, precision, and recall, offering a practical, interpretable framework for automated credit risk control.

Abstract

This study focuses on the problem of credit default prediction, builds a modeling framework based on machine learning, and conducts comparative experiments on a variety of mainstream classification algorithms. Through preprocessing, feature engineering, and model training of the Home Credit dataset, the performance of multiple models including logistic regression, random forest, XGBoost, LightGBM, etc. in terms of accuracy, precision, and recall is evaluated. The results show that the ensemble learning method has obvious advantages in predictive performance, especially in dealing with complex nonlinear relationships between features and data imbalance problems. It shows strong robustness. At the same time, the SHAP method is used to analyze the importance and dependency of features, and it is found that the external credit score variable plays a dominant role in model decision making, which helps to improve the model's interpretability and practical application value. The research results provide effective reference and technical support for the intelligent development of credit risk control systems.

Paper Structure

This paper contains 9 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overall modeling framework diagram.
  • Figure 2: Feature Importance Analysis.
  • Figure 3: SHAP Dependency Graph.