Table of Contents
Fetching ...

Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

Anthony Christidis, Jeyshinee Pyneeandee, Gabriela Cohen-Freue

Abstract

The analysis of high-dimensional data, common in fields such as genomics, is complicated by the presence of cellwise contamination, where individual cells rather than entire rows are corrupted. This contamination poses a significant challenge to standard variable selection techniques. While recent ensemble methods have introduced deterministic frameworks that partition the predictor space to manage high collinearity, these architectures were not designed to handle cellwise contamination, leaving a critical methodological gap. To bridge this gap, we propose the Fast and Scalable Cellwise-Robust Ensemble (FSCRE) algorithm, a multi-stage framework integrating three key statistical stages. First, the algorithm establishes a robust foundation by deriving a cleaned data matrix and a reliable, cellwise-robust covariance structure. Variable selection then proceeds via a competitive ensemble: a robust, correlation-based formulation of the Least-Angle Regression (LARS) algorithm proposes candidates for multiple sub-models, and a cross-validation criterion arbitrates their final assignment. Despite its architectural complexity, the proposed method enjoys fundamental theoretical guarantees, including invariance properties and local selection stability. Through extensive simulations and a bioinformatics application, we demonstrate FSCRE's superior performance in variable selection precision, recall, and predictive accuracy across various contamination scenarios. This work provides a unified framework connecting cellwise-robust estimation with high-performance ensemble learning, with an implementation available on CRAN.

Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

Abstract

The analysis of high-dimensional data, common in fields such as genomics, is complicated by the presence of cellwise contamination, where individual cells rather than entire rows are corrupted. This contamination poses a significant challenge to standard variable selection techniques. While recent ensemble methods have introduced deterministic frameworks that partition the predictor space to manage high collinearity, these architectures were not designed to handle cellwise contamination, leaving a critical methodological gap. To bridge this gap, we propose the Fast and Scalable Cellwise-Robust Ensemble (FSCRE) algorithm, a multi-stage framework integrating three key statistical stages. First, the algorithm establishes a robust foundation by deriving a cleaned data matrix and a reliable, cellwise-robust covariance structure. Variable selection then proceeds via a competitive ensemble: a robust, correlation-based formulation of the Least-Angle Regression (LARS) algorithm proposes candidates for multiple sub-models, and a cross-validation criterion arbitrates their final assignment. Despite its architectural complexity, the proposed method enjoys fundamental theoretical guarantees, including invariance properties and local selection stability. Through extensive simulations and a bioinformatics application, we demonstrate FSCRE's superior performance in variable selection precision, recall, and predictive accuracy across various contamination scenarios. This work provides a unified framework connecting cellwise-robust estimation with high-performance ensemble learning, with an implementation available on CRAN.
Paper Structure (27 sections, 6 theorems, 29 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 27 sections, 6 theorems, 29 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

The set of selected predictor indices returned by the FSCRE algorithm is invariant to per-column affine transformations of the observed data matrix $[\mathbf{y}, \mathbf{X}]$.

Figures (5)

  • Figure 1: MSPE across 50 splits for the Mixture Correlation scenario. Performance is evaluated across three SNRs and three sparsity levels. Because the error is scaled by the noise variance, the optimal possible MSPE is 1.0.
  • Figure 2: Recall (top row) and precision (bottom row) across 50 splits for the Mixture Correlation scenario, comparing the DDC-EN pipeline with the proposed FSCRE algorithm.
  • Figure 3: Median computational execution time (in seconds) for the proposed FSCRE algorithm and the baseline DDC-EN pipeline, plotted on logarithmic axes. Left: Time as a function of the number of predictors $p$, with sample size fixed at $n=100$. Right: Time as a function of sample size $n$, with predictors fixed at $p=1{,}000$.
  • Figure 4: MSPE across 50 random splits for the prediction of ER-$\alpha$ protein abundance. The models were evaluated on both the original TCGA data (light grey) and data subjected to targeted artificial cellwise contamination in the training set (dark grey).
  • Figure 5: Median Mean Squared Prediction Error (MSPE), Recall, and Precision of the FSCRE algorithm as a function of the number of sub-models ($K$). Results are shown for the Mixture Correlation scenario (SNR $= 1.0$) across three sparsity levels (50, 100, and 200 active predictors) over 50 replications.

Theorems & Definitions (6)

  • Proposition 1: General Affine Invariance
  • Proposition 2: Permutation Equivariance
  • Proposition 3: Intercept Invariance
  • Proposition 4: Computational Complexity
  • Proposition 5: Local Selection Stability
  • Lemma 1