Table of Contents
Fetching ...

Geometric Manifold Rectification for Imbalanced Learning

Xubin Wang, Qing Li, Weijia Jia

TL;DR

Geometric Manifold Rectification (GMR) tackles imbalanced classification on structured data by integrating local geometric priors into a preprocessing framework. It introduces inverse-distance weighted kNN-based geometric confidence estimation and an adaptive asymmetric cleaning policy that aggressively removes intrusive majority samples while conservatively protecting minority samples, supported by theoretical results on posterior shifting and variance reduction. Empirically, GMR demonstrates classifier-agnostic improvements across 27 datasets and 7 learners, and provides benefits to deep tabular models and even fixed-feature image tasks, indicating broad applicability. The approach offers a practical, model-agnostic data rectification step that complements loss-based and synthesis-based strategies, with clear pathways to extension to multi-class settings and scalable nearest-neighbor implementations.

Abstract

Imbalanced classification presents a formidable challenge in machine learning, particularly when tabular datasets are plagued by noise and overlapping class boundaries. From a geometric perspective, the core difficulty lies in the topological intrusion of the majority class into the minority manifold, which obscures the true decision boundary. Traditional undersampling techniques, such as Edited Nearest Neighbours (ENN), typically employ symmetric cleaning rules and uniform voting, failing to capture the local manifold structure and often inadvertently removing informative minority samples. In this paper, we propose GMR (Geometric Manifold Rectification), a novel framework designed to robustly handle imbalanced structured data by exploiting local geometric priors. GMR makes two contributions: (1) Geometric confidence estimation that uses inverse-distance weighted kNN voting with an adaptive distance metric to capture local reliability; and (2) asymmetric cleaning that is strict on majority samples while conservatively protecting minority samples via a safe-guarding cap on minority removal. Extensive experiments on multiple benchmark datasets show that GMR is competitive with strong sampling baselines.

Geometric Manifold Rectification for Imbalanced Learning

TL;DR

Geometric Manifold Rectification (GMR) tackles imbalanced classification on structured data by integrating local geometric priors into a preprocessing framework. It introduces inverse-distance weighted kNN-based geometric confidence estimation and an adaptive asymmetric cleaning policy that aggressively removes intrusive majority samples while conservatively protecting minority samples, supported by theoretical results on posterior shifting and variance reduction. Empirically, GMR demonstrates classifier-agnostic improvements across 27 datasets and 7 learners, and provides benefits to deep tabular models and even fixed-feature image tasks, indicating broad applicability. The approach offers a practical, model-agnostic data rectification step that complements loss-based and synthesis-based strategies, with clear pathways to extension to multi-class settings and scalable nearest-neighbor implementations.

Abstract

Imbalanced classification presents a formidable challenge in machine learning, particularly when tabular datasets are plagued by noise and overlapping class boundaries. From a geometric perspective, the core difficulty lies in the topological intrusion of the majority class into the minority manifold, which obscures the true decision boundary. Traditional undersampling techniques, such as Edited Nearest Neighbours (ENN), typically employ symmetric cleaning rules and uniform voting, failing to capture the local manifold structure and often inadvertently removing informative minority samples. In this paper, we propose GMR (Geometric Manifold Rectification), a novel framework designed to robustly handle imbalanced structured data by exploiting local geometric priors. GMR makes two contributions: (1) Geometric confidence estimation that uses inverse-distance weighted kNN voting with an adaptive distance metric to capture local reliability; and (2) asymmetric cleaning that is strict on majority samples while conservatively protecting minority samples via a safe-guarding cap on minority removal. Extensive experiments on multiple benchmark datasets show that GMR is competitive with strong sampling baselines.
Paper Structure (37 sections, 6 theorems, 27 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 37 sections, 6 theorems, 27 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Lemma 3.2

Let $\mathcal{D}'$ be obtained from $\mathcal{D}$ by removing a subset $R \subset \mathcal{D}$. The empirical posterior on the cleaned data satisfies: where $p_1(x)=\hat{P}_{\mathcal{D}}(Y=1|x)$, $p_0(x)=\hat{P}_{\mathcal{D}}(Y=0|x)$, and $r_c = |R \cap \mathcal{D}_c|/|\mathcal{D}_c|$ is the (class-conditional) removal rate. If $r_0 > r_1$ (asymmetric cleaning that removes relatively more majorit

Figures (5)

  • Figure 1: Illustration of the core challenges in imbalanced classification. The majority class (blue circles) significantly outnumbers the minority class (red squares). The dashed box highlights the overlap region where samples from both classes intermingle, creating ambiguity. Black dots mark intrusive majority samples that have penetrated the minority manifold—these ambiguous boundary samples degrade classifier performance and are the primary targets of GMR's geometric cleaning strategy.
  • Figure 2: Visual illustration of the GMR framework. Top row: (a) Original — an imbalanced dataset where the majority class (blue circles) intrudes into the minority manifold (red squares); the dashed box marks the overlap region containing ambiguous samples (black dots). (b) Weighting — GMR computes geometric confidence via distance-weighted $k$-NN (formula shown in panel); arrow thickness denotes neighbor influence (thicker = closer = higher weight). (c) Cleaned — Asymmetric cleaning removes low-confidence majority samples (crossed gray circles) while conservatively protecting minority samples, producing a smoother decision boundary shifted away from the minority manifold. Bottom legend: node types and arrow thickness ($\propto$ weight).
  • Figure 3: Deep tabular baseline comparison on large-scale datasets (AUPRC; mean over 5 seeds). Bars show TabDDPM (raw) vs. TabDDPM+GMR (pre-cleaned).
  • Figure 4: Abbreviations used for resampling methods in tables.
  • Figure 5: Heatmap of average ranks across 7 classifiers (averaged over 27 datasets). Colors map low ranks (better) to green and high ranks (worse) to red.

Theorems & Definitions (12)

  • Definition 3.1: Asymmetric Risk
  • Lemma 3.2: Data Cleaning Effect on Posteriors
  • proof
  • Theorem 3.3: Geometric Weighting (Variance-Reduction Condition)
  • proof : Sketch
  • Corollary 3.4: Adaptive Metric Selection
  • Proposition 3.5: Asymmetric Boundary Alignment
  • proof : Sketch
  • Theorem B.1: Restatement
  • proof
  • ...and 2 more