Table of Contents
Fetching ...

Local distribution-based adaptive oversampling for imbalanced regression

Shayan Alahyari, Mike Domaratzki

TL;DR

Imbalanced regression suffers from sparse regions in the continuous target distribution, especially for rare values. The authors introduce LDAO, a data-level oversampling method that learns local joint distributions by clustering in the joint feature-target space, estimating local densities with Gaussian KDE, and oversampling within each cluster before merging. This approach avoids arbitrary rarity thresholds and preserves the intrinsic statistical structure, achieving superior performance on 45 diverse datasets compared with state-of-the-art methods, as measured by RMSE, MAE, and SERA. The results demonstrate robust improvement in both frequent and rare target regions, offering a practical, data-driven solution for imbalanced regression with broad applicability across domains.

Abstract

Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions that are difficult for machine learning models to predict accurately. This issue particularly affects neural networks, which often struggle with imbalanced data. While class imbalance in classification has been extensively studied, imbalanced regression remains relatively unexplored, with few effective solutions. Existing approaches often rely on arbitrary thresholds to categorize samples as rare or frequent, ignoring the continuous nature of target distributions. These methods can produce synthetic samples that fail to improve model performance and may discard valuable information through undersampling. To address these limitations, we propose LDAO (Local Distribution-based Adaptive Oversampling), a novel data-level approach that avoids categorizing individual samples as rare or frequent. Instead, LDAO learns the global distribution structure by decomposing the dataset into a mixture of local distributions, each preserving its statistical characteristics. LDAO then models and samples from each local distribution independently before merging them into a balanced training set. LDAO achieves a balanced representation across the entire target range while preserving the inherent statistical structure within each local distribution. In extensive evaluations on 45 imbalanced datasets, LDAO outperforms state-of-the-art oversampling methods on both frequent and rare target values, demonstrating its effectiveness for addressing the challenge of imbalanced regression.

Local distribution-based adaptive oversampling for imbalanced regression

TL;DR

Imbalanced regression suffers from sparse regions in the continuous target distribution, especially for rare values. The authors introduce LDAO, a data-level oversampling method that learns local joint distributions by clustering in the joint feature-target space, estimating local densities with Gaussian KDE, and oversampling within each cluster before merging. This approach avoids arbitrary rarity thresholds and preserves the intrinsic statistical structure, achieving superior performance on 45 diverse datasets compared with state-of-the-art methods, as measured by RMSE, MAE, and SERA. The results demonstrate robust improvement in both frequent and rare target regions, offering a practical, data-driven solution for imbalanced regression with broad applicability across domains.

Abstract

Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions that are difficult for machine learning models to predict accurately. This issue particularly affects neural networks, which often struggle with imbalanced data. While class imbalance in classification has been extensively studied, imbalanced regression remains relatively unexplored, with few effective solutions. Existing approaches often rely on arbitrary thresholds to categorize samples as rare or frequent, ignoring the continuous nature of target distributions. These methods can produce synthetic samples that fail to improve model performance and may discard valuable information through undersampling. To address these limitations, we propose LDAO (Local Distribution-based Adaptive Oversampling), a novel data-level approach that avoids categorizing individual samples as rare or frequent. Instead, LDAO learns the global distribution structure by decomposing the dataset into a mixture of local distributions, each preserving its statistical characteristics. LDAO then models and samples from each local distribution independently before merging them into a balanced training set. LDAO achieves a balanced representation across the entire target range while preserving the inherent statistical structure within each local distribution. In extensive evaluations on 45 imbalanced datasets, LDAO outperforms state-of-the-art oversampling methods on both frequent and rare target values, demonstrating its effectiveness for addressing the challenge of imbalanced regression.

Paper Structure

This paper contains 21 sections, 19 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The left image shows a classification problem with easily identifiable minority (red) and majority (green) classes. The right image illustrates an imbalanced regression problem where target values in the sparse region (red regions) are underrepresented, making them more difficult to detect and accurately predict.
  • Figure 2: LDAO process overview. First, the imbalanced dataset (top) is decomposed into clusters. Then, each cluster is oversampled individually using kernel density estimation (middle). Lastly, these balanced clusters are combined into one dataset (bottom).
  • Figure 3: K‑means clustering in the joint feature–target space for the Boston dataset harrison1978hedonic, projected onto the first three principal components. Data points are grouped into three clusters (each indicated by a unique marker and color). Ellipsoidal density contours characterize the spread and orientation of each cluster. The centroids, marked with prominent gold symbols, represent the mean positions of the data points within their respective clusters.
  • Figure 4: Clusters obtained via $k$‑means on the feature–target data are projected onto two principal components. For each cluster, kernel density estimation (KDE) is performed and the overlaid contour lines represent the density gradients of the data in the reduced space.
  • Figure 5: Each cluster is independently oversampled based on its own density estimate, ensuring that both sparse and dense areas are adequately represented without altering their local characteristics; the same procedure is applied to the remaining clusters, with cluster 1 displayed here.
  • ...and 4 more figures