Cellwise and Casewise Robust Covariance in High Dimensions
Fabio Centofanti, Mia Hubert, Peter J. Rousseeuw
TL;DR
This work tackles robust covariance estimation in high-dimensional data contaminated by both cellwise and casewise outliers, including missing values. It develops cellRCov, a covariance estimator built on a covariance decomposition $\boldsymbol\Sigma = \boldsymbol\Sigma_{X^k} + \boldsymbol\Sigma_{X^{\perp}}$, where a robust PCA-based step yields the principal subspace and a robustly imputed residual covariances are combined with ridge regularization for stability. The authors establish consistency and asymptotic normality, derive both casewise and cellwise influence functions, and demonstrate superior performance in simulations and real-data tasks such as anomaly detection and robust canonical correlation analysis (cellRCCA). The method offers a practical, scalable tool for robust multivariate analysis in high dimensions, with data-driven procedures to select the rank $k$ and the regularization parameter $\delta$. Overall, cellRCov enables reliable inference under complex contamination, expanding the toolkit for high-dimensional robust statistics.
Abstract
The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation is only feasible up to at most 20 dimensions. To remedy this we propose the cellRCov method, a robust covariance estimator that simultaneously handles casewise outliers, cellwise outliers, and missing data. It relies on a decomposition of the covariance on principal and orthogonal subspaces, leveraging recent work on robust PCA. It also employs a ridge-type regularization to stabilize the estimated covariance matrix. We establish some theoretical properties of cellRCov, including its casewise and cellwise influence functions as well as consistency and asymptotic normality. A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios. Furthermore, its practical utility is illustrated in a real-world application to anomaly detection. We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.
