Table of Contents
Fetching ...

Neighbor displacement-based enhanced synthetic oversampling for multiclass imbalanced data

I Made Putrama, Peter Martinek

TL;DR

The paper tackles multiclass imbalanced classification by introducing Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO), a two-stage method that first displaces noisy samples toward their class centroids (NDE) and then balances data via random oversampling (ROS). This approach aims to preserve informative structure while reducing noise and overlap, addressing shortcomings of standard SMOTE variants. Across 20 real-world and synthetic datasets, nine classifiers, and 14 baselines, NDESO attains a high average G-mean and a top-ranked performance, with strong statistical significance (e.g., $p = 6.08\\times 10^{-20}$ in Friedman–Nemenyi tests). The work demonstrates practical impact for imbalanced multiclass problems and provides a foundation for future scalable resampling in real-world settings.

Abstract

Imbalanced multiclass datasets pose challenges for machine learning algorithms. These datasets often contain minority classes that are important for accurate prediction. Existing methods still suffer from sparse data and may not accurately represent the original data patterns, leading to noise and poor model performance. A hybrid method called Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO) is proposed in this paper. This approach uses a displacement strategy for noisy data points, computing the average distance to their neighbors and moving them closer to their centroids. Random oversampling is then performed to achieve dataset balance. Extensive evaluations compare 14 alternatives on nine classifiers across synthetic and 20 real-world datasets with varying imbalance ratios. The results show that our method outperforms its competitors regarding average G-mean score and achieves the lowest statistical mean rank. This highlights its superiority and suitability for addressing data imbalance in practical applications.

Neighbor displacement-based enhanced synthetic oversampling for multiclass imbalanced data

TL;DR

The paper tackles multiclass imbalanced classification by introducing Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO), a two-stage method that first displaces noisy samples toward their class centroids (NDE) and then balances data via random oversampling (ROS). This approach aims to preserve informative structure while reducing noise and overlap, addressing shortcomings of standard SMOTE variants. Across 20 real-world and synthetic datasets, nine classifiers, and 14 baselines, NDESO attains a high average G-mean and a top-ranked performance, with strong statistical significance (e.g., in Friedman–Nemenyi tests). The work demonstrates practical impact for imbalanced multiclass problems and provides a foundation for future scalable resampling in real-world settings.

Abstract

Imbalanced multiclass datasets pose challenges for machine learning algorithms. These datasets often contain minority classes that are important for accurate prediction. Existing methods still suffer from sparse data and may not accurately represent the original data patterns, leading to noise and poor model performance. A hybrid method called Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO) is proposed in this paper. This approach uses a displacement strategy for noisy data points, computing the average distance to their neighbors and moving them closer to their centroids. Random oversampling is then performed to achieve dataset balance. Extensive evaluations compare 14 alternatives on nine classifiers across synthetic and 20 real-world datasets with varying imbalance ratios. The results show that our method outperforms its competitors regarding average G-mean score and achieves the lowest statistical mean rank. This highlights its superiority and suitability for addressing data imbalance in practical applications.
Paper Structure (20 sections, 7 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Visualization of resampling on a sparse multiclass dataset: (a) original dataset; (b) noisy resampled dataset using SMOTE
  • Figure 2: Visual illustration of the CDNN algorithm Wang2022d
  • Figure 3: The overlapping (before) and cleaned (after) data points
  • Figure 4: Visual illustration of displace-able data point identification
  • Figure 5: G-mean scores of our NDE algorithm evaluated across three classifiers for various $k$-nearest neighbors (2-25)
  • ...and 6 more figures