Table of Contents
Fetching ...

Simplicial SMOTE: Oversampling Solution to the Imbalanced Learning Problem

Oleg Kachan, Andrey Savchenko, Gleb Gusev

TL;DR

Imbalanced learning is addressed by Simplicial SMOTE, which models the minority class with a neighborhood simplicial complex and samples from higher-dimensional $p$-simplices using barycentric coordinates drawn from a Dirichlet distribution $Dir(\boldsymbol{\alpha})$ with $\boldsymbol{\alpha}=(1,\dots,1)$. The approach generalizes SMOTE and variants such as Borderline SMOTE, Safe-level SMOTE, and ADASYN within a simplicial framework, yielding improved data coverage and enabling synthetic points to lie closer to the decision boundary near the majority class. Empirical results on synthetic and real datasets show consistent gains in F1 and MCC over SMOTE and graph-based methods, with substantial improvements on certain tasks and a modest runtime overhead primarily due to clique computations. These findings highlight the practical value of topological data modeling for imbalanced learning and provide guidance on hyperparameters ($k$ and $p$) to balance coverage and generalization.

Abstract

SMOTE (Synthetic Minority Oversampling Technique) is the established geometric approach to random oversampling to balance classes in the imbalanced learning problem, followed by many extensions. Its idea is to introduce synthetic data points of the minor class, with each new point being the convex combination of an existing data point and one of its k-nearest neighbors. In this paper, by viewing SMOTE as sampling from the edges of a geometric neighborhood graph and borrowing tools from the topological data analysis, we propose a novel technique, Simplicial SMOTE, that samples from the simplices of a geometric neighborhood simplicial complex. A new synthetic point is defined by the barycentric coordinates w.r.t. a simplex spanned by an arbitrary number of data points being sufficiently close rather than a pair. Such a replacement of the geometric data model results in better coverage of the underlying data distribution compared to existing geometric sampling methods and allows the generation of synthetic points of the minority class closer to the majority class on the decision boundary. We experimentally demonstrate that our Simplicial SMOTE outperforms several popular geometric sampling methods, including the original SMOTE. Moreover, we show that simplicial sampling can be easily integrated into existing SMOTE extensions. We generalize and evaluate simplicial extensions of the classic Borderline SMOTE, Safe-level SMOTE, and ADASYN algorithms, all of which outperform their graph-based counterparts.

Simplicial SMOTE: Oversampling Solution to the Imbalanced Learning Problem

TL;DR

Imbalanced learning is addressed by Simplicial SMOTE, which models the minority class with a neighborhood simplicial complex and samples from higher-dimensional -simplices using barycentric coordinates drawn from a Dirichlet distribution with . The approach generalizes SMOTE and variants such as Borderline SMOTE, Safe-level SMOTE, and ADASYN within a simplicial framework, yielding improved data coverage and enabling synthetic points to lie closer to the decision boundary near the majority class. Empirical results on synthetic and real datasets show consistent gains in F1 and MCC over SMOTE and graph-based methods, with substantial improvements on certain tasks and a modest runtime overhead primarily due to clique computations. These findings highlight the practical value of topological data modeling for imbalanced learning and provide guidance on hyperparameters ( and ) to balance coverage and generalization.

Abstract

SMOTE (Synthetic Minority Oversampling Technique) is the established geometric approach to random oversampling to balance classes in the imbalanced learning problem, followed by many extensions. Its idea is to introduce synthetic data points of the minor class, with each new point being the convex combination of an existing data point and one of its k-nearest neighbors. In this paper, by viewing SMOTE as sampling from the edges of a geometric neighborhood graph and borrowing tools from the topological data analysis, we propose a novel technique, Simplicial SMOTE, that samples from the simplices of a geometric neighborhood simplicial complex. A new synthetic point is defined by the barycentric coordinates w.r.t. a simplex spanned by an arbitrary number of data points being sufficiently close rather than a pair. Such a replacement of the geometric data model results in better coverage of the underlying data distribution compared to existing geometric sampling methods and allows the generation of synthetic points of the minority class closer to the majority class on the decision boundary. We experimentally demonstrate that our Simplicial SMOTE outperforms several popular geometric sampling methods, including the original SMOTE. Moreover, we show that simplicial sampling can be easily integrated into existing SMOTE extensions. We generalize and evaluate simplicial extensions of the classic Borderline SMOTE, Safe-level SMOTE, and ADASYN algorithms, all of which outperform their graph-based counterparts.

Paper Structure

This paper contains 18 sections, 5 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: For the configuration of three points of the minor class (black circles) equidistant to a point of the major class (blue cross) b) Simplicial SMOTE will generate synthetic points of the minor class (red circles) closer to the point of the major class (projection distance to the 2-simplex $d_2 = 0.577$), than a) SMOTE (projection distance to any edge $d_1 = 0.707$), effectively moving the local decision boundary. c) Mean projection distance to the geometric model of minority class gets smaller with increasing maximal relation arity parameter $p$. Distance to the simplicial model is shown as solid lines for different values of neighborhood size parameter $k$, distance to the graph model is shown as a dashed line of the same color.
  • Figure 2: Synthetic data: a) moons, b) swiss rolls, c) a Gaussian inside a sphere, d) a sphere inside a sphere.
  • Figure 3: Critical difference diagram for the $k$-NN classifier and F1 score.
  • Figure 4: Critical difference diagram for the gradient boosting classifier and F1 score.
  • Figure 5: Sensitivity for Simplicial SMOTE's hyperparameters -- neighborhood size $k$ and maximum clique size $p$, followed by the nearest neighbor classifier. Performances in terms of F1 score for various $k$ and $p$ are shown as solid lines. Baseline SMOTE performance for the same $k$ is shown as a dashed line of the same color.
  • ...and 4 more figures