Table of Contents
Fetching ...

PARIS: Pruning Algorithm via the Representer theorem for Imbalanced Scenarios

Enrico Camporeale

TL;DR

This work tackles imbalanced regression by pruning training data through a representer-theorem-based framework that quantifies each point's impact on validation loss without retraining. It derives a closed-form deletion residual and uses Cholesky rank-one updates to perform fast, greedy pruning, maintaining performance on tail events while drastically reducing dataset size. The method is validated on space-weather Dst forecasting, showing up to 75% data reduction with preserved or improved tail performance and competitive overall RMSE. PARIS also provides an interpretable mechanism for understanding sample influence, with potential extensions to multi-output settings and streaming data.

Abstract

The challenge of \textbf{imbalanced regression} arises when standard Empirical Risk Minimization (ERM) biases models toward high-frequency regions of the data distribution, causing severe degradation on rare but high-impact ``tail'' events. Existing strategies uch as loss re-weighting or synthetic over-sampling often introduce noise, distort the underlying distribution, or add substantial algorithmic complexity. We introduce \textbf{PARIS} (Pruning Algorithm via the Representer theorem for Imbalanced Scenarios), a principled framework that mitigates imbalance by \emph{optimizing the training set itself}. PARIS leverages the representer theorem for neural networks to compute a \textbf{closed-form representer deletion residual}, which quantifies the exact change in validation loss caused by removing a single training point \emph{without retraining}. Combined with an efficient Cholesky rank-one downdating scheme, PARIS performs fast, iterative pruning that eliminates uninformative or performance-degrading samples. We use a real-world space weather example, where PARIS reduces the training set by up to 75\% while preserving or improving overall RMSE, outperforming re-weighting, synthetic oversampling, and boosting baselines. Our results demonstrate that representer-guided dataset pruning is a powerful, interpretable, and computationally efficient approach to rare-event regression.

PARIS: Pruning Algorithm via the Representer theorem for Imbalanced Scenarios

TL;DR

This work tackles imbalanced regression by pruning training data through a representer-theorem-based framework that quantifies each point's impact on validation loss without retraining. It derives a closed-form deletion residual and uses Cholesky rank-one updates to perform fast, greedy pruning, maintaining performance on tail events while drastically reducing dataset size. The method is validated on space-weather Dst forecasting, showing up to 75% data reduction with preserved or improved tail performance and competitive overall RMSE. PARIS also provides an interpretable mechanism for understanding sample influence, with potential extensions to multi-output settings and streaming data.

Abstract

The challenge of \textbf{imbalanced regression} arises when standard Empirical Risk Minimization (ERM) biases models toward high-frequency regions of the data distribution, causing severe degradation on rare but high-impact ``tail'' events. Existing strategies uch as loss re-weighting or synthetic over-sampling often introduce noise, distort the underlying distribution, or add substantial algorithmic complexity. We introduce \textbf{PARIS} (Pruning Algorithm via the Representer theorem for Imbalanced Scenarios), a principled framework that mitigates imbalance by \emph{optimizing the training set itself}. PARIS leverages the representer theorem for neural networks to compute a \textbf{closed-form representer deletion residual}, which quantifies the exact change in validation loss caused by removing a single training point \emph{without retraining}. Combined with an efficient Cholesky rank-one downdating scheme, PARIS performs fast, iterative pruning that eliminates uninformative or performance-degrading samples. We use a real-world space weather example, where PARIS reduces the training set by up to 75\% while preserving or improving overall RMSE, outperforming re-weighting, synthetic oversampling, and boosting baselines. Our results demonstrate that representer-guided dataset pruning is a powerful, interpretable, and computationally efficient approach to rare-event regression.

Paper Structure

This paper contains 21 sections, 29 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Cumulative Distribution Function of the Dst data used for the imbalanced regression evaluation.
  • Figure 2: Conditional RMSE (cRMSE) vs. Storm Intensity Threshold $T$. cRMSE is calculated for all samples where the ground truth $D_{st} \leq T$. Lower values of $T$ (more negative $D_{st}$) represent the most severe storm events.
  • Figure 3: Absolute error for the 10 strongest storm values across the whole dataset. The vertical axis is linear, but cut off at 150nT (XGB and SMOGN exceed this value for all cases). For event 3 (-413.0 nT) PARIS' absolute residual is smaller than 1 nT.