Table of Contents
Fetching ...

Close to Reality: Interpretable and Feasible Data Augmentation for Imbalanced Learning

Matheus Camilo da Silva, Gabriel Gustavo Costanzo, Andrea de Lorenzo, Sylvio Barbon Junior

Abstract

Many machine learning classification tasks involve imbalanced datasets, which are often subject to over-sampling techniques aimed at improving model performance. However, these techniques are prone to generating unrealistic or infeasible samples. Furthermore, they often function as black boxes, lacking interpretability in their procedures. This opacity makes it difficult to track their effectiveness and provide necessary adjustments, and they may ultimately fail to yield significant performance improvements. To bridge this gap, we introduce the Decision Predicate Graphs for Data Augmentation (DPG-da), a framework that extracts interpretable decision predicates from trained models to capture domain rules and enforce them during sample generation. This design ensures that over-sampled data remain diverse, constraint-satisfying, and interpretable. In experiments on synthetic and real-world benchmark datasets, DPG-da consistently improves classification performance over traditional over-sampling methods, while guaranteeing logical validity and offering clear, interpretable explanations of the over-sampled data.

Close to Reality: Interpretable and Feasible Data Augmentation for Imbalanced Learning

Abstract

Many machine learning classification tasks involve imbalanced datasets, which are often subject to over-sampling techniques aimed at improving model performance. However, these techniques are prone to generating unrealistic or infeasible samples. Furthermore, they often function as black boxes, lacking interpretability in their procedures. This opacity makes it difficult to track their effectiveness and provide necessary adjustments, and they may ultimately fail to yield significant performance improvements. To bridge this gap, we introduce the Decision Predicate Graphs for Data Augmentation (DPG-da), a framework that extracts interpretable decision predicates from trained models to capture domain rules and enforce them during sample generation. This design ensures that over-sampled data remain diverse, constraint-satisfying, and interpretable. In experiments on synthetic and real-world benchmark datasets, DPG-da consistently improves classification performance over traditional over-sampling methods, while guaranteeing logical validity and offering clear, interpretable explanations of the over-sampled data.
Paper Structure (32 sections, 4 equations, 15 figures, 10 tables)

This paper contains 32 sections, 4 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: DPG structure showing predicates as nodes and class leaves.
  • Figure 2: Workflow of the proposed DPG-da. Constraints extracted from DPGs guide the optimization for valid and diverse synthetic samples.
  • Figure 3: Heatmap of normalized constraint violation rates per over-sampling method and dataset. Violation rate is computed as the number of violations divided by the number of synthesized samples, averaged across repeated runs.
  • Figure 4: Mean classification performance across augmentation methods and sampling percentages. Red background areas indicate methods that violated constraints (DE, SMOTE-LVQ, SMOTE-POLYNOM, and SMOTE-SVM). The dashed line indicates the average performance of the classifiers on the original datasets without augmentation.
  • Figure 5: Critical difference diagram for F1-score performance. Methods not connected by a bar differ significantly (Nemenyi test, $\alpha=0.05$, CD = 1.167).
  • ...and 10 more figures