Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

Anthony Christidis; Jeyshinee Pyneeandee; Gabriela Cohen-Freue

Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

Anthony Christidis, Jeyshinee Pyneeandee, Gabriela Cohen-Freue

Abstract

The analysis of high-dimensional data, common in fields such as genomics, is complicated by the presence of cellwise contamination, where individual cells rather than entire rows are corrupted. This contamination poses a significant challenge to standard variable selection techniques. While recent ensemble methods have introduced deterministic frameworks that partition the predictor space to manage high collinearity, these architectures were not designed to handle cellwise contamination, leaving a critical methodological gap. To bridge this gap, we propose the Fast and Scalable Cellwise-Robust Ensemble (FSCRE) algorithm, a multi-stage framework integrating three key statistical stages. First, the algorithm establishes a robust foundation by deriving a cleaned data matrix and a reliable, cellwise-robust covariance structure. Variable selection then proceeds via a competitive ensemble: a robust, correlation-based formulation of the Least-Angle Regression (LARS) algorithm proposes candidates for multiple sub-models, and a cross-validation criterion arbitrates their final assignment. Despite its architectural complexity, the proposed method enjoys fundamental theoretical guarantees, including invariance properties and local selection stability. Through extensive simulations and a bioinformatics application, we demonstrate FSCRE's superior performance in variable selection precision, recall, and predictive accuracy across various contamination scenarios. This work provides a unified framework connecting cellwise-robust estimation with high-performance ensemble learning, with an implementation available on CRAN.

Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

Abstract

Paper Structure (27 sections, 6 theorems, 29 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 27 sections, 6 theorems, 29 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
Background and Literature Review
Regression under Cellwise Contamination
Cellwise-Robust Methodologies
Ensemble Methods and the Unaddressed Methodological Gap
The Fast and Scalable Cellwise-Robust Ensemble Algorithm
Robust Foundation
The Robust LARS Candidate Proposer
Predictive Arbitration and Final Model Fitting
Theoretical Properties and Complexity
Invariance and Equivariance Properties
Computational Complexity
Local Selection Stability
Simulation Study
Data Generation and Contamination Models
...and 12 more sections

Key Result

Proposition 1

The set of selected predictor indices returned by the FSCRE algorithm is invariant to per-column affine transformations of the observed data matrix $[\mathbf{y}, \mathbf{X}]$.

Figures (5)

Figure 1: MSPE across 50 splits for the Mixture Correlation scenario. Performance is evaluated across three SNRs and three sparsity levels. Because the error is scaled by the noise variance, the optimal possible MSPE is 1.0.
Figure 2: Recall (top row) and precision (bottom row) across 50 splits for the Mixture Correlation scenario, comparing the DDC-EN pipeline with the proposed FSCRE algorithm.
Figure 3: Median computational execution time (in seconds) for the proposed FSCRE algorithm and the baseline DDC-EN pipeline, plotted on logarithmic axes. Left: Time as a function of the number of predictors $p$, with sample size fixed at $n=100$. Right: Time as a function of sample size $n$, with predictors fixed at $p=1{,}000$.
Figure 4: MSPE across 50 random splits for the prediction of ER-$\alpha$ protein abundance. The models were evaluated on both the original TCGA data (light grey) and data subjected to targeted artificial cellwise contamination in the training set (dark grey).
Figure 5: Median Mean Squared Prediction Error (MSPE), Recall, and Precision of the FSCRE algorithm as a function of the number of sub-models ($K$). Results are shown for the Mixture Correlation scenario (SNR $= 1.0$) across three sparsity levels (50, 100, and 200 active predictors) over 50 replications.

Theorems & Definitions (6)

Proposition 1: General Affine Invariance
Proposition 2: Permutation Equivariance
Proposition 3: Intercept Invariance
Proposition 4: Computational Complexity
Proposition 5: Local Selection Stability
Lemma 1

Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

Abstract

Fast and Scalable Cellwise-Robust Ensembles for High-Dimensional Data

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)