Table of Contents
Fetching ...

HyperCore: Coreset Selection under Noise via Hypersphere Models

Brian B. Moser, Arundhati S. Shanbhag, Tobias C. Nauen, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel

TL;DR

HyperCore presents a robust coreset selection framework that explicitly handles annotation noise by learning class-wise hypersphere embeddings with a fixed center $\\mathbf{c}=\\mathbf{0}$ and applying adaptive pruning via Youden's $J$ statistic. This yields per-class conformity scores based on embedding norms and enables automatic, noise-aware subset selection without tuning global hyperparameters. Empirical results on ImageNet-1K and CIFAR-10 show HyperCore delivers strong performance under noisy and low-data regimes, outperforming or matching state-of-the-art baselines while maintaining near-linear, embarrassingly parallel computation. The work offers a practical, scalable approach to data pruning that improves robustness and training efficiency in real-world noisy-label contexts.

Abstract

The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.

HyperCore: Coreset Selection under Noise via Hypersphere Models

TL;DR

HyperCore presents a robust coreset selection framework that explicitly handles annotation noise by learning class-wise hypersphere embeddings with a fixed center and applying adaptive pruning via Youden's statistic. This yields per-class conformity scores based on embedding norms and enables automatic, noise-aware subset selection without tuning global hyperparameters. Empirical results on ImageNet-1K and CIFAR-10 show HyperCore delivers strong performance under noisy and low-data regimes, outperforming or matching state-of-the-art baselines while maintaining near-linear, embarrassingly parallel computation. The work offers a practical, scalable approach to data pruning that improves robustness and training efficiency in real-world noisy-label contexts.

Abstract

The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.

Paper Structure

This paper contains 22 sections, 2 theorems, 12 equations, 4 figures, 4 tables.

Key Result

lemma thmcounterlemma

Assume that each training batch is balanced, i.e. the number of in-class ($y=0$) samples equals the number of out-of-class ($y=1$) samples. Let $W_0$ be the all-zero weight configuration such that $\phi(\mathbf{x};W_0)=\mathbf{0}$ for all $\mathbf{x}$. Then the HyperCore loss at $W_0$ is unbounded: which rules out the trivial solution as optimal.

Figures (4)

  • Figure 1: Left: Visualization of HyperCore. In-class samples are pulled toward the center, while out-of-class samples are pushed away, creating a clear separation. Right: Illustration of adaptive pruning ratio selection via Youden’s J statistic. Two candidate thresholds are compared, with the purple threshold yielding a higher $J$ value and thus being considered more optimal for pruning.
  • Figure 2: Time-Measurement on CIFAR-10. HyperCore ranks among the fastest techniques, including training, averaging only 4 minutes per class and benefiting from a parallelizable design.
  • Figure 3: Average hypersphere radii (adaptive thresholds) and their standard deviations as a function of the relabeling percentage. The plot reveals that both the mean radius and its variability increase with higher levels of label poisoning, reflecting a broader dispersion in the embedding space and an adaptive expansion of the decision boundary to accommodate noise.
  • Figure 4: Left: Confusion-based metrics (TPR, FPR, TNR, FNR) under increasing label poisoning in CIFAR-10. Right: Youden’s $J$ (orange) and fraction of removed samples (blue). Both plots highlight HyperCore’s robust coreset selection behavior across varying degrees of poisoned labels (error-bands highlight the variance between the class labels).

Theorems & Definitions (6)

  • definition thmcounterdefinition: Coreset Selection
  • definition thmcounterdefinition: Hypersphere Classifier
  • definition thmcounterdefinition: HyperCore Loss
  • lemma thmcounterlemma: Balanced Sampling Prevents Trivial Collapse
  • proof
  • lemma thmcounterlemma: Threshold search complexity