Table of Contents
Fetching ...

Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness

Sangil Han, Kyoowon Kim, Sungkyu Jung

TL;DR

This work analyzes subspace recovery via Winsorized PCA (WPCA), deriving concentration bounds for the WPCA subspace under elliptical distributions and data contamination. It introduces a strong breakdown notion for subspace-valued statistics and provides lower bounds and perturbation results showing WPCA robustness exceeds traditional PCA, especially in high dimensions. The results demonstrate consistency at minimax-like rates in suitable regimes and reveal a trade-off in the winsorization radius: too small or too large $r$ can hurt accuracy, while moderate winsorization yields robustness with accurate subspace recovery. The findings establish WPCA as a robust, scalable tool for high-dimensional data with outliers and heavy tails, and suggest future work on spike models and practical radius tuning.

Abstract

In this paper, we explore the theoretical properties of subspace recovery using Winsorized Principal Component Analysis (WPCA), utilizing a common data transformation technique that caps extreme values to mitigate the impact of outliers. Despite the widespread use of winsorization in various tasks of multivariate analysis, its theoretical properties, particularly for subspace recovery, have received limited attention. We provide a detailed analysis of the accuracy of WPCA, showing that increasing the number of samples while decreasing the proportion of outliers guarantees the consistency of the sample subspaces from WPCA with respect to the true population subspace. Furthermore, we establish perturbation bounds that ensure the WPCA subspace obtained from contaminated data remains close to the subspace recovered from pure data. Additionally, we extend the classical notion of breakdown points to subspace-valued statistics and derive lower bounds for the breakdown points of WPCA. Our analysis demonstrates that WPCA exhibits strong robustness to outliers while maintaining consistency under mild assumptions. A toy example is provided to numerically illustrate the behavior of the upper bounds for perturbation bounds and breakdown points, emphasizing winsorization's utility in subspace recovery.

Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness

TL;DR

This work analyzes subspace recovery via Winsorized PCA (WPCA), deriving concentration bounds for the WPCA subspace under elliptical distributions and data contamination. It introduces a strong breakdown notion for subspace-valued statistics and provides lower bounds and perturbation results showing WPCA robustness exceeds traditional PCA, especially in high dimensions. The results demonstrate consistency at minimax-like rates in suitable regimes and reveal a trade-off in the winsorization radius: too small or too large can hurt accuracy, while moderate winsorization yields robustness with accurate subspace recovery. The findings establish WPCA as a robust, scalable tool for high-dimensional data with outliers and heavy tails, and suggest future work on spike models and practical radius tuning.

Abstract

In this paper, we explore the theoretical properties of subspace recovery using Winsorized Principal Component Analysis (WPCA), utilizing a common data transformation technique that caps extreme values to mitigate the impact of outliers. Despite the widespread use of winsorization in various tasks of multivariate analysis, its theoretical properties, particularly for subspace recovery, have received limited attention. We provide a detailed analysis of the accuracy of WPCA, showing that increasing the number of samples while decreasing the proportion of outliers guarantees the consistency of the sample subspaces from WPCA with respect to the true population subspace. Furthermore, we establish perturbation bounds that ensure the WPCA subspace obtained from contaminated data remains close to the subspace recovered from pure data. Additionally, we extend the classical notion of breakdown points to subspace-valued statistics and derive lower bounds for the breakdown points of WPCA. Our analysis demonstrates that WPCA exhibits strong robustness to outliers while maintaining consistency under mild assumptions. A toy example is provided to numerically illustrate the behavior of the upper bounds for perturbation bounds and breakdown points, emphasizing winsorization's utility in subspace recovery.

Paper Structure

This paper contains 28 sections, 11 theorems, 67 equations, 4 figures.

Key Result

Theorem 1

Assume $\mathbf x_i|_{i\not\in\mathcal{I}_{\epsilon}}, \overset{\rm{i.i.d}}{\sim} \mathcal{F}_{\hbox{\boldmath{$\Sigma$}}}$ follow an elliptical distribution and $\lambda_d > \lambda_{d+1}$. Let $\lambda_j^{(r)}$ denote the $j$th largest eigenvalue of $\hbox{Cov}(\mathbf x^{(r)})$, where $\mathbf x^ Moreover, if for all $k = 1,2,\dots$ with some $\sigma > 0$, then

Figures (4)

  • Figure 1: Empirical expectation $\widehat{E}[\sin \Theta_{\epsilon}^{(r)}]$ for different tail behavior and contamination levels. Panels (a) and (b) show the results when $\mathbf x_i$ follows a multivariate $t_{3}$-distribution, while (c) and (d) represent the case where $\mathbf x_i$ follows a multivariate Gaussian distribution. In each figure, $\epsilon$ denotes the proportion of contaminated data.
  • Figure 2: Empirical expectation $\widehat{E}[\sin \Theta_{\epsilon}^{(r)}]$ for different tail behaviors. Panels (a) and (c) show the results under non-spiked model with the $t_3$ and Gaussian distributions, respectively. Panels (b) and (d) represent the spiked model.
  • Figure 3: Estimated lower bounds for the breakdown points in \ref{['eq:bp_win']}.
  • Figure 4: The largest principal angle $\Theta(\mathcal{V}_1^{(r)}(\mathbf X_\epsilon), \mathcal{V}_1^{(r)}(\mathbf X_0))$ and the perturbation bound versus contamination level $\epsilon$. The solid line represents the observed largest principal angle, the dotted line represents the perturbation bound from Theorem \ref{['thm:perturbation']}, and the vertical dashed line indicates the lower bound of the (weak) breakdown point from Theorem \ref{['thm:breakdown']}.

Theorems & Definitions (23)

  • Theorem 1
  • Corollary 2
  • Definition 1
  • Definition 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6: wainwrightHighDimensionalStatisticsNonAsymptotic2019b
  • Corollary 7
  • proof
  • ...and 13 more