Table of Contents
Fetching ...

Adaptive kernel-density approach for imbalanced binary classification

Kotaro J. Nishimura, Yuichi Sakumura, Kazushi Ikeda

TL;DR

This work tackles severe class imbalance in binary classification by introducing KOTARO, a KDE-inspired classifier with density-adaptive bandwidths. For each sample, the local bandwidth is derived from $d_i=\max_{j\in\mathcal{N}_n(i)} d(i,j)$ and the kernel is $k(\mathbf{x},\mathbf{x}_i)=\exp(-\gamma_i\|\mathbf{x}-\mathbf{x}_i\|^2)$ with $\gamma_i=1/d_i$, forming a discriminant $f(\mathbf{x})=\sum_i w_i k(\mathbf{x}_i,\mathbf{x})$ where $\mathbf{w}=\mathbf{K}^{-1}\mathbf{y}$. This density-adaptive approach sharpens boundaries in high-density (majority) regions while expanding in sparse (minority) regions, improving minority detection under extreme imbalance. The method is validated on synthetic EI/DI datasets and real-world imbalanced medical data, with Boruta feature selection used to assess robustness to noisy features; results show superior performance under severe imbalance, especially for EI-type distributions, and reveal a two-phase strategy: use KOTARO on raw data and switch to ensemble or re-sampled methods after feature curation. The work highlights practical impact for critical domains such as medical diagnosis, where minority-class recognition is essential, and outlines future directions for automatic imbalance-type identification and broader distribution testing.

Abstract

Class imbalance is a common challenge in real-world binary classification tasks, often leading to predictions biased toward the majority class and reduced recognition of the minority class. This issue is particularly critical in domains such as medical diagnosis and anomaly detection, where correct classification of minority classes is essential. Conventional methods often fail to deliver satisfactory performance when the imbalance ratio is extremely severe. To address this challenge, we propose a novel approach called Kernel-density-Oriented Threshold Adjustment with Regional Optimization (KOTARO), which extends the framework of kernel density estimation (KDE) by adaptively adjusting decision boundaries according to local sample density. In KOTARO, the bandwidth of Gaussian basis functions is dynamically tuned based on the estimated density around each sample, thereby enhancing the classifier's ability to capture minority regions. We validated the effectiveness of KOTARO through experiments on both synthetic and real-world imbalanced datasets. The results demonstrated that KOTARO outperformed conventional methods, particularly under conditions of severe imbalance, highlighting its potential as a promising solution for a wide range of imbalanced classification problems

Adaptive kernel-density approach for imbalanced binary classification

TL;DR

This work tackles severe class imbalance in binary classification by introducing KOTARO, a KDE-inspired classifier with density-adaptive bandwidths. For each sample, the local bandwidth is derived from and the kernel is with , forming a discriminant where . This density-adaptive approach sharpens boundaries in high-density (majority) regions while expanding in sparse (minority) regions, improving minority detection under extreme imbalance. The method is validated on synthetic EI/DI datasets and real-world imbalanced medical data, with Boruta feature selection used to assess robustness to noisy features; results show superior performance under severe imbalance, especially for EI-type distributions, and reveal a two-phase strategy: use KOTARO on raw data and switch to ensemble or re-sampled methods after feature curation. The work highlights practical impact for critical domains such as medical diagnosis, where minority-class recognition is essential, and outlines future directions for automatic imbalance-type identification and broader distribution testing.

Abstract

Class imbalance is a common challenge in real-world binary classification tasks, often leading to predictions biased toward the majority class and reduced recognition of the minority class. This issue is particularly critical in domains such as medical diagnosis and anomaly detection, where correct classification of minority classes is essential. Conventional methods often fail to deliver satisfactory performance when the imbalance ratio is extremely severe. To address this challenge, we propose a novel approach called Kernel-density-Oriented Threshold Adjustment with Regional Optimization (KOTARO), which extends the framework of kernel density estimation (KDE) by adaptively adjusting decision boundaries according to local sample density. In KOTARO, the bandwidth of Gaussian basis functions is dynamically tuned based on the estimated density around each sample, thereby enhancing the classifier's ability to capture minority regions. We validated the effectiveness of KOTARO through experiments on both synthetic and real-world imbalanced datasets. The results demonstrated that KOTARO outperformed conventional methods, particularly under conditions of severe imbalance, highlighting its potential as a promising solution for a wide range of imbalanced classification problems

Paper Structure

This paper contains 16 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Imbalanced samples and algorithms. (a) Data structure distortion due to oversampling. (b) Information loss due to undersampling. (c) Limitations of boundary estimation through learning in region without samples. (d) The proposed method uses a kernel dependent on local sample density, represented in one dimensional space for clarity.
  • Figure 2: Workflow of the KOTARO method.
  • Figure 3: Conceptual diagram of artificial imbalance data for evaluating the performance of binary classifiers. (a) Extreme imbalance: The majority label samples are concentrated in small areas. (b) Divergent imbalance: The overall density is not skewed, but the minority samples are limited to small areas.
  • Figure 4: Comparison of classification boundaries for the EI dataset in two dimensions. Blue points indicate the majority class, and red points indicate the minority class. The total number of samples is 100, with a majority-to-minority ratio of 9:1.
  • Figure 6: Binary classification accuracy for imbalanced high-dimensional synthetic datasets. The upper and lower rows correspond to EI-type and DI-type imbalance, respectively, and from left to right the graphs show results for three-, six-, and nine-dimensional samples. The vertical axis denotes classification accuracy, while the horizontal axis indicates the imbalance ratio ($M_i / M_a$), where smaller values correspond to stronger imbalance. Because the test set was balanced (50 positive + 50 negative samples), always predicting the positive class yields 50% accuracy as a baseline. Each curve shows the mean accuracy with standard errors computed over 20 independent trials.
  • ...and 1 more figures