On the Robustness of the Successive Projection Algorithm
Giovanni Barbarino, Nicolas Gillis
TL;DR
This work analyzes the robustness of the successive projection algorithm (SPA) for separable SSMF under noise, formalizing how the conditioning of the vertex matrix $W$ governs recovery error. It introduces tighter bounds for the first SPA step, extends improved guarantees to the rank-2 and certain translated variants (T-SPA), and proves tightness results for SPA, SPA$^2$, and MVIE-based preconditioning. A novel translation+lifting variant (TL-SPA) is proposed to reduce conditioning and improve practical robustness, with validated gains on synthetic datasets including adversarial middle-point noise and rank-deficient scenarios. Overall, the results provide both theoretical guarantees and practical guidance for selecting SPA variants and preprocessing to reliably recover latent simplex vertices in noisy environments.
Abstract
The successive projection algorithm (SPA) is a workhorse algorithm to learn the $r$ vertices of the convex hull of a set of $(r-1)$-dimensional data points, a.k.a. a latent simplex, which has numerous applications in data science. In this paper, we revisit the robustness to noise of SPA and several of its variants. In particular, when $r \geq 3$, we prove the tightness of the existing error bounds for SPA and for two more robust preconditioned variants of SPA. We also provide significantly improved error bounds for SPA, by a factor proportional to the conditioning of the $r$ vertices, in two special cases: for the first extracted vertex, and when $r \leq 2$. We then provide further improvements for the error bounds of a translated version of SPA proposed by Arora et al. (''A practical algorithm for topic modeling with provable guarantees'', ICML, 2013) in two special cases: for the first two extracted vertices, and when $r \leq 3$. Finally, we propose a new more robust variant of SPA that first shifts and lifts the data points in order to minimize the conditioning of the problem. We illustrate our results on synthetic data.
