Table of Contents
Fetching ...

ALPCAHUS: Subspace Clustering for Heteroscedastic Data

Javier Salazar Cavazos, Jeffrey A Fessler, Laura Balzano

TL;DR

ALPCAHUS addresses subspace clustering under heteroscedastic noise by jointly learning per-sample noise variances and subspace bases for each cluster. It extends LR-ALPCAH to the union-of-subspaces setting and incorporates an ensemble strategy to stabilize clustering via co-association affinities, with convergence guarantees and adaptive rank estimation. Empirical results on synthetic and real data (quasars and Indian Pines hyperspectral imagery) show that ALPCAHUS often outperforms conventional methods, approaching the performance of a noisy oracle, especially when data quality varies across samples. The method demonstrates practical robustness to heteroscedasticity and highlights future directions for manifold extensions and feature-space heteroscedastic models.

Abstract

Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that comes from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-based subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at https://github.com/javiersc1/ALPCAHUS.

ALPCAHUS: Subspace Clustering for Heteroscedastic Data

TL;DR

ALPCAHUS addresses subspace clustering under heteroscedastic noise by jointly learning per-sample noise variances and subspace bases for each cluster. It extends LR-ALPCAH to the union-of-subspaces setting and incorporates an ensemble strategy to stabilize clustering via co-association affinities, with convergence guarantees and adaptive rank estimation. Empirical results on synthetic and real data (quasars and Indian Pines hyperspectral imagery) show that ALPCAHUS often outperforms conventional methods, approaching the performance of a noisy oracle, especially when data quality varies across samples. The method demonstrates practical robustness to heteroscedasticity and highlights future directions for manifold extensions and feature-space heteroscedastic models.

Abstract

Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that comes from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-based subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at https://github.com/javiersc1/ALPCAHUS.

Paper Structure

This paper contains 33 sections, 1 theorem, 36 equations, 9 figures, 1 table.

Key Result

Theorem 1

Consider the ALPCAHUS cost function $f(\mathcal{L}, \mathcal{R}, \bm{\Pi}\xspace, \mathcal{C}\xspace)$ in (eq:alpcahus). Assume a noise variance threshold parameter $\alpha \in \mathbb{R} > 0$ that lower bounds all $\nu_i$, and the cluster assignment criteria in (eq:cluster_update) that accepts chan

Figures (9)

  • Figure 1: Two 1D subspaces, colored blue and yellow, with data consisting of two noise groups shown with circle (low noise) and triangle (high noise) markers.
  • Figure 2: Clustering error over the heteroscedastic landscape for various subspace clustering algorithms.
  • Figure 3: Percentage difference (%) of ALPCAHUS clustering error subtracted from EKSS while good data amount varies.
  • Figure 4: Clustering error (%) for TIPS initialization scheme vs. random initialization for the ALPCAHUS method ($B=1$).
  • Figure 5: Adaptive rank estimation using eigengap heuristic and proposed FlipPA approach (true rank $d = 6$).
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof