Table of Contents
Fetching ...

Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data

Xu Sun, Zixuan Qin, Shun Zhang, Yuexian Wang, Li Huang

TL;DR

This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails generating synthetic samples specifically tailored to the minority class, filtering using binary feedback to refine samples, and self-learning with pseudo-labels.

Abstract

In the financial risk domain, particularly in credit default prediction and fraud detection, accurate identification of high-risk class instances is paramount, as their occurrence can have significant economic implications. Although machine learning models have gained widespread adoption for risk prediction, their performance is often hindered by the scarcity and diversity of high-quality data. This limitation stems from factors in datasets such as small risk sample sizes, high labeling costs, and severe class imbalance, which impede the models' ability to learn effectively and accurately forecast critical events. This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails: (1) generating synthetic samples specifically tailored to the minority class, (2) filtering using binary feedback to refine samples, and (3) self-learning with pseudo-labels. Our experiments across six benchmark datasets reveal the efficacy of TriEnhance, with a notable focus on improving minority class calibration, a key factor for developing more robust financial risk prediction systems.

Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data

TL;DR

This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails generating synthetic samples specifically tailored to the minority class, filtering using binary feedback to refine samples, and self-learning with pseudo-labels.

Abstract

In the financial risk domain, particularly in credit default prediction and fraud detection, accurate identification of high-risk class instances is paramount, as their occurrence can have significant economic implications. Although machine learning models have gained widespread adoption for risk prediction, their performance is often hindered by the scarcity and diversity of high-quality data. This limitation stems from factors in datasets such as small risk sample sizes, high labeling costs, and severe class imbalance, which impede the models' ability to learn effectively and accurately forecast critical events. This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails: (1) generating synthetic samples specifically tailored to the minority class, (2) filtering using binary feedback to refine samples, and (3) self-learning with pseudo-labels. Our experiments across six benchmark datasets reveal the efficacy of TriEnhance, with a notable focus on improving minority class calibration, a key factor for developing more robust financial risk prediction systems.
Paper Structure (14 sections, 17 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 14 sections, 17 equations, 6 figures, 2 tables, 3 algorithms.

Figures (6)

  • Figure 1: Ideal dataset vs Real dataset. (a) and (b) respectively represent the datasets in the real-world environment and the ideal environment. The data environment in reality faces challenges such as insufficient sample size, class imbalance, and a large amount of unlabeled data that has not been effectively utilized, while ideal data environment is characterized by sufficient sample size and balanced classes.
  • Figure 2: Overview of the TriEnhance Architecture.
  • Figure 3: Comparison of distribution before and after data synthesis.
  • Figure 4: Comparison of distribution before and after data filtering.
  • Figure 5: Comparison of distribution of data before and after self-learning.
  • ...and 1 more figures