Optimal Bound for PCA with Outliers using Higher-Degree Voronoi Diagrams
Sajjad Hashemian, Mohammad Saeed Arvenaghi, Ebrahim Ardeshir-Larijani
TL;DR
This work addresses robust PCA in the presence of outliers by recasting the problem through higher-degree Voronoi diagrams to partition the subspace search space and identify the optimal $r$-dimensional subspace. The authors present an exact algorithm with worst-case time $n^{d+\mathcal{O}(1)} \cdot \text{poly}(n,d)$ and a randomized method with time $2^{\mathcal{O}(r(d-r))} \cdot \text{poly}(n,d)$ that leverages Grassmannian sampling and an $\alpha$-gap separation to guarantee high-probability recovery of the correct subspace. The approach provides a clearer, geometry-driven framework for outlier-robust PCA and offers practical scalability for high-dimensional data, along with theoretical optimality bounds under standard complexity assumptions. Potential extensions include improved sampling strategies and dual geometric constructions such as Delaunay triangulations and online variants of PCA.
Abstract
In this paper, we introduce new algorithms for Principal Component Analysis (PCA) with outliers. Utilizing techniques from computational geometry, specifically higher-degree Voronoi diagrams, we navigate to the optimal subspace for PCA even in the presence of outliers. This approach achieves an optimal solution with a time complexity of $n^{d+\mathcal{O}(1)}\text{poly}(n,d)$. Additionally, we present a randomized algorithm with a complexity of $2^{\mathcal{O}(r(d-r))} \times \text{poly}(n, d)$. This algorithm samples subspaces characterized in terms of a Grassmannian manifold. By employing such sampling method, we ensure a high likelihood of capturing the optimal subspace, with the success probability $(1 - δ)^T$. Where $δ$ represents the probability that a sampled subspace does not contain the optimal solution, and $T$ is the number of subspaces sampled, proportional to $2^{r(d-r)}$. Our use of higher-degree Voronoi diagrams and Grassmannian based sampling offers a clearer conceptual pathway and practical advantages, particularly in handling large datasets or higher-dimensional settings.
