Table of Contents
Fetching ...

Optimal Downsampling for Imbalanced Classification with Generalized Linear Models

Yan Chen, Jose Blanchet, Krzysztof Dembczynski, Laura Fee Nern, Aaron Flores

TL;DR

This work develops a theoretically grounded approach to downsampling for imbalanced binary classification within generalized linear models. It introduces a pseudo maximum likelihood estimator that can be computed directly from downsampled data, along with an asymptotic normality theory in a rare-event regime where the minority probability vanishes as the sample size grows. A budget-constrained efficiency framework yields an explicit optimal downsampling rate $\alpha^*$, balancing statistical precision and computational cost, with detailed results specialized to logistic regression. Empirical experiments on synthetic and real datasets demonstrate that the proposed pseudo MLE often outperforms existing downsampling estimators, particularly in highly imbalanced settings, and the framework generalizes to neural network contexts via GLM-inspired insights.

Abstract

Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We propose a pseudo maximum likelihood estimator and study its asymptotic normality in the context of increasingly imbalanced populations relative to an increasingly large sample size. We provide theoretical guarantees for the introduced estimator. Additionally, we compute the optimal downsampling rate using a criterion that balances statistical accuracy and computational efficiency. Our numerical experiments, conducted on both synthetic and empirical data, further validate our theoretical results, and demonstrate that the introduced estimator outperforms commonly available alternatives.

Optimal Downsampling for Imbalanced Classification with Generalized Linear Models

TL;DR

This work develops a theoretically grounded approach to downsampling for imbalanced binary classification within generalized linear models. It introduces a pseudo maximum likelihood estimator that can be computed directly from downsampled data, along with an asymptotic normality theory in a rare-event regime where the minority probability vanishes as the sample size grows. A budget-constrained efficiency framework yields an explicit optimal downsampling rate , balancing statistical precision and computational cost, with detailed results specialized to logistic regression. Empirical experiments on synthetic and real datasets demonstrate that the proposed pseudo MLE often outperforms existing downsampling estimators, particularly in highly imbalanced settings, and the framework generalizes to neural network contexts via GLM-inspired insights.

Abstract

Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We propose a pseudo maximum likelihood estimator and study its asymptotic normality in the context of increasingly imbalanced populations relative to an increasingly large sample size. We provide theoretical guarantees for the introduced estimator. Additionally, we compute the optimal downsampling rate using a criterion that balances statistical accuracy and computational efficiency. Our numerical experiments, conducted on both synthetic and empirical data, further validate our theoretical results, and demonstrate that the introduced estimator outperforms commonly available alternatives.

Paper Structure

This paper contains 27 sections, 14 theorems, 104 equations, 7 figures, 1 table.

Key Result

Proposition 1

The prediction score of the downsampled random variables follows $\mathbb{P}(\tilde{Y}_i=1|\tilde{X}_i)=\frac{\mathbb{P}(Y_i=1|X_i=1)}{\alpha+(1-\alpha)\mathbb{P}(Y_i=1|X_i=1)}.$ Hence $\{(\tilde{Y}_i,\tilde{X}_i)\}_{i=1}^N$ still follows GLM, with the conditional probability $\mathbb{P}(\tilde{Y}_i is the c.d.f. of some random variable.

Figures (7)

  • Figure 1: Estimation error for different $\alpha$.
  • Figure 2: On the left panel of each figure, we plot MSE of inverse-weighting (blue) vs. pseudo-MLE (green) vs. conditional MLE (red) for $\alpha$ chosen around $\mathbb{P}(Y=1)$ for $\tau_n=10.0,9.8,6.0,5.0$ with Logistic Regression. The blue,green,red dashed lines correspond to the $95\%$ confidence intervals for the squared losses of inverse-weighting estimator, pseudo MLE and conditional MLE. The upper and lower ends are computed by $\pm1.96*\frac{\hat{\sigma}}{\sqrt{500}}$ and $\hat{\sigma}$ is the standard deviation of squared losses at each alpha computed over $500$ random environments. On the right panel of each figure the green solid lines correspond to the average squared loss differences between inverse-weighting estimator and pseudo MLE, and the purple one corresponds to that of conditional MLE minus pseudo MLE. And the dash lines are the $95\%$ confidence intervals.
  • Figure 3: Additional results for abalone_19 dataset for small and moderate values of $\tau_n$.
  • Figure 4: Additional results for yeast_me2 dataset for small and moderate values of $\tau_n$.
  • Figure 5: Mean-squared-error and Efficiency costs
  • ...and 2 more figures

Theorems & Definitions (30)

  • Proposition 1: Downsample Prediction Score
  • Proposition 2: Downsample Joint Likelihood of $(\tilde{Y},\tilde{X})$
  • Remark 1
  • Theorem 1: Asymptotic Normality of MLE as $\tau_n\rightarrow\infty$
  • Remark 2
  • Remark 3
  • Theorem 2: Generalized Scaled Asymptotic Normality
  • Theorem 3: Optimal Downsampling Rate for Imbalanced Classification
  • Proposition 3: Asymptotic Normality for Logistic Regression
  • Remark 4
  • ...and 20 more