Table of Contents
Fetching ...

Imbalance in Regression Datasets

Daniel Kowatsch, Nicolas M. Müller, Kilian Tscharke, Philip Sperl, Konstantin Bötinger

TL;DR

This work argues that imbalance in regression is a real analogue of classification imbalance, driven by over- and under-representation in the target distribution and causing regressors to neglect rare targets. It develops a generalized target balance framework based on relevance measures μ and probability distributions P_Y, and proposes Kolmogorov and Wasserstein metrics to quantify imbalance as d(μ, P_Y). The authors validate the approach with synthetic and real-world data, showing that standard MAE can miss deterioration in rare-target regions while per-bin or weighted metrics reveal the bias, and that Wasserstein distance correlates strongly with performance measures. Overall, the paper provides a theoretical and empirical foundation for measuring and addressing regression imbalance, paving the way for robust training strategies and mitigation methods.

Abstract

For classification, the problem of class imbalance is well known and has been extensively studied. In this paper, we argue that imbalance in regression is an equally important problem which has so far been overlooked: Due to under- and over-representations in a data set's target distribution, regressors are prone to degenerate to naive models, systematically neglecting uncommon training data and over-representing targets seen often during training. We analyse this problem theoretically and use resulting insights to develop a first definition of imbalance in regression, which we show to be a generalisation of the commonly employed imbalance measure in classification. With this, we hope to turn the spotlight on the overlooked problem of imbalance in regression and to provide common ground for future research.

Imbalance in Regression Datasets

TL;DR

This work argues that imbalance in regression is a real analogue of classification imbalance, driven by over- and under-representation in the target distribution and causing regressors to neglect rare targets. It develops a generalized target balance framework based on relevance measures μ and probability distributions P_Y, and proposes Kolmogorov and Wasserstein metrics to quantify imbalance as d(μ, P_Y). The authors validate the approach with synthetic and real-world data, showing that standard MAE can miss deterioration in rare-target regions while per-bin or weighted metrics reveal the bias, and that Wasserstein distance correlates strongly with performance measures. Overall, the paper provides a theoretical and empirical foundation for measuring and addressing regression imbalance, paving the way for robust training strategies and mitigation methods.

Abstract

For classification, the problem of class imbalance is well known and has been extensively studied. In this paper, we argue that imbalance in regression is an equally important problem which has so far been overlooked: Due to under- and over-representations in a data set's target distribution, regressors are prone to degenerate to naive models, systematically neglecting uncommon training data and over-representing targets seen often during training. We analyse this problem theoretically and use resulting insights to develop a first definition of imbalance in regression, which we show to be a generalisation of the commonly employed imbalance measure in classification. With this, we hope to turn the spotlight on the overlooked problem of imbalance in regression and to provide common ground for future research.
Paper Structure (25 sections, 24 equations, 4 figures, 1 table)

This paper contains 25 sections, 24 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: A histogram of the target variables of the https://archive.ics.uci.edu/ml/datasets/abalone training dataset (red bars). The blue line illustrates the mean absolute test error per histogram bin of a neural network trained on this data. Note the heavy imbalance in the target variable. Because of this, rare values (i.e. target $\geq 15$) are poorly predicted. The model degenerates to only predict targets in the interval $[5, 10]$.
  • Figure 2: The impact of imbalance in classification data.
  • Figure 3: The impact of imbalance in regression data.
  • Figure 4: Histogram plots of three synthetic data sets, where the continuous target variable is bimodally distributed. For each figure, a neural network regressor is trained on an equally imbalanced train set and evaluated on a test set (red bars). We plot the regressors predicted targets via KDE plot (blue line). The data set in the top figure has an imbalance factor of $3$, the middle one of $10$, and the bottom one of $20$. For higher degrees of imbalance, we observe the same phenomena as in classification: The regressors fails to capture the minority mode data and degenerates to a naive model.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2