Table of Contents
Fetching ...

Histogram Approaches for Imbalanced Data Streams Regression

Ehsan Aminian, Rita P. Ribeiro, Joao Gama

TL;DR

This work tackles imbalanced regression in data streams by introducing histogram-based sampling strategies that dynamically detect rare regions across the target distribution. Leveraging Partitional Incremental Discretization (PiD) online histograms, the methods HistUS (undersampling) and HistOS (oversampling) adjust training focus toward rare values, without assuming rarity only at distribution tails. Empirical results across synthetic and real-world benchmarks show HistUS and HistOS improve rare-case predictive accuracy (RMSE$_{\phi}$, SERA) and outperform prior Chebyshev-based streaming approaches, albeit with trade-offs in overall RMSE. The approach offers a practical, non-parametric way to handle evolving imbalanced regression in real-time applications and opens avenues for extensions to concept drift and multi-target problems.

Abstract

Imbalanced domains pose a significant challenge in real-world predictive analytics, particularly in the context of regression. While existing research has primarily focused on batch learning from static datasets, limited attention has been given to imbalanced regression in online learning scenarios. Intending to address this gap, in prior work, we proposed sampling strategies based on Chebyshevs inequality as the first methodologies designed explicitly for data streams. However, these approaches operated under the restrictive assumption that rare instances exclusively reside at distribution extremes. This study introduces histogram-based sampling strategies to overcome this constraint, proposing flexible solutions for imbalanced regression in evolving data streams. The proposed techniques -- Histogram-based Undersampling (HistUS) and Histogram-based Oversampling (HistOS) -- employ incremental online histograms to dynamically detect and prioritize rare instances across arbitrary regions of the target distribution to improve predictions in the rare cases. Comprehensive experiments on synthetic and real-world benchmarks demonstrate that HistUS and HistOS substantially improve rare-case prediction accuracy, outperforming baseline models while maintaining competitiveness with Chebyshev-based approaches.

Histogram Approaches for Imbalanced Data Streams Regression

TL;DR

This work tackles imbalanced regression in data streams by introducing histogram-based sampling strategies that dynamically detect rare regions across the target distribution. Leveraging Partitional Incremental Discretization (PiD) online histograms, the methods HistUS (undersampling) and HistOS (oversampling) adjust training focus toward rare values, without assuming rarity only at distribution tails. Empirical results across synthetic and real-world benchmarks show HistUS and HistOS improve rare-case predictive accuracy (RMSE, SERA) and outperform prior Chebyshev-based streaming approaches, albeit with trade-offs in overall RMSE. The approach offers a practical, non-parametric way to handle evolving imbalanced regression in real-time applications and opens avenues for extensions to concept drift and multi-target problems.

Abstract

Imbalanced domains pose a significant challenge in real-world predictive analytics, particularly in the context of regression. While existing research has primarily focused on batch learning from static datasets, limited attention has been given to imbalanced regression in online learning scenarios. Intending to address this gap, in prior work, we proposed sampling strategies based on Chebyshevs inequality as the first methodologies designed explicitly for data streams. However, these approaches operated under the restrictive assumption that rare instances exclusively reside at distribution extremes. This study introduces histogram-based sampling strategies to overcome this constraint, proposing flexible solutions for imbalanced regression in evolving data streams. The proposed techniques -- Histogram-based Undersampling (HistUS) and Histogram-based Oversampling (HistOS) -- employ incremental online histograms to dynamically detect and prioritize rare instances across arbitrary regions of the target distribution to improve predictions in the rare cases. Comprehensive experiments on synthetic and real-world benchmarks demonstrate that HistUS and HistOS substantially improve rare-case prediction accuracy, outperforming baseline models while maintaining competitiveness with Chebyshev-based approaches.

Paper Structure

This paper contains 25 sections, 6 equations, 17 figures, 7 tables, 2 algorithms.

Figures (17)

  • Figure 1: Synthetic dataset (a), histogram of target values (b) and respective relevance values (c) of common and rare cases considering $thr_\phi$ of 0.9.
  • Figure 2: Comparison of Probability and $K$ values between the Chebyshev-based and the histogram-based approaches for the synthetic dataset.
  • Figure 3: True target values and respective predictions using the Chebyshev-based sampling and histogram-based sampling for the synthetic dataset. The orange-highlighted areas in the background correspond to the rare regions in the target value domain.
  • Figure 4: Influence of different values of the sampling decay parameter $\beta$ on the probability values obtained over the synthetic dataset.
  • Figure 5: Effect of different values of the reduction coefficient parameter $\alpha$ on the oversampling rate. The parameter $\beta$ is fixed at $4$, while $\alpha$ is varied to analyze its impact on oversampling frequency across target values. The orange-highlighted areas in the background correspond to the rare regions in the target value domain.
  • ...and 12 more figures