Table of Contents
Fetching ...

Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection

Yuwen Jiang, Songyun Ye

Abstract

The prevailing IR-threshold paradigm posits a positive correlation between imbalance ratio (IR) and oversampling effectiveness, yet this assumption remains empirically unsubstantiated through controlled experimentation. We conducted 12 controlled experiments (N > 100 dataset variants) that systematically manipulated IR while holding data characteristics (class separability, cluster structure) constant via algorithmic generation of Gaussian mixture datasets. Two additional validation experiments examined ceiling effects and metric-dependence. All methods were evaluated on 17 real-world datasets from OpenML. Upon controlling for confounding variables, IR exhibited a weak to moderate negative correlation with oversampling benefits. Class separability emerged as a substantially stronger moderator, accounting for significantly more variance in method effectiveness than IR alone. We propose a 'Context Matters' framework that integrates IR, class separability, and cluster structure to provide evidence-based selection criteria for practitioners.

Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection

Abstract

The prevailing IR-threshold paradigm posits a positive correlation between imbalance ratio (IR) and oversampling effectiveness, yet this assumption remains empirically unsubstantiated through controlled experimentation. We conducted 12 controlled experiments (N > 100 dataset variants) that systematically manipulated IR while holding data characteristics (class separability, cluster structure) constant via algorithmic generation of Gaussian mixture datasets. Two additional validation experiments examined ceiling effects and metric-dependence. All methods were evaluated on 17 real-world datasets from OpenML. Upon controlling for confounding variables, IR exhibited a weak to moderate negative correlation with oversampling benefits. Class separability emerged as a substantially stronger moderator, accounting for significantly more variance in method effectiveness than IR alone. We propose a 'Context Matters' framework that integrates IR, class separability, and cluster structure to provide evidence-based selection criteria for practitioners.

Paper Structure

This paper contains 50 sections, 3 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Theoretical framework: Data characteristics as moderators of the IR-oversampling effectiveness relationship. Solid arrows indicate main effects; dashed arrows indicate moderation effects.
  • Figure 2: Algorithmic workflow of the 12 controlled experiments. The three-stage design progresses from observational replication (Stage 1) through controlled manipulation (Stage 2) to moderation testing (Stage 3). B1: Within-dataset IR manipulation; B2: Generated data with controlled parameters; C1-C3: Moderation experiments for separability, cluster structure, and sample size.
  • Figure 3: Separability moderates oversampling effectiveness. Low separability data benefits more from oversampling.
  • Figure 4: Validation experiments demonstrating metric-dependence and ceiling effects in the IR-oversampling effectiveness relationship. Top row: Ceiling effect control showing absolute vs. relative improvement correlations. Bottom row: Multi-metric validation displaying effect sizes across eight evaluation metrics. Error bars represent 95% confidence intervals.
  • Figure 5: Comparison of SMOTE and BorderlineSMOTE: SMOTE generates samples through linear interpolation (creating "bridge" structures), while BorderlineSMOTE focuses only on boundary regions.