Table of Contents
Fetching ...

Spectral Self-supervised Feature Selection

Daniel Segal, Ofir Lindenbaum, Ariel Jaffe

TL;DR

This paper addresses unsupervised feature selection in high-dimensional data by introducing a spectral self-supervised framework that uses robust pseudo-labels derived from graph-Laplacian eigenvectors. The core idea is to (a) generate discrete pseudo-labels from eigenvectors via binarization, (b) select a stable subset of eigenvectors using a model-variability criterion, and (c) score features by training surrogate models to predict these pseudo-labels, with a max-aggregation across the selected eigenvectors. The approach is supported by theory on eigenvector convergence in manifold settings and a product-manifold model, and it is demonstrated to be robust to outliers and complex substructures across eight real-world datasets, with notable effectiveness on biological data. The proposed SSFS framework also emphasizes interpretability and flexibility by allowing different surrogate models and by enabling stability-based validation of the selected features, which has practical impact for clustering and manifold-learning tasks in high-dimensional domains.

Abstract

Choosing a meaningful subset of features from high-dimensional observations in unsupervised settings can greatly enhance the accuracy of downstream analysis, such as clustering or dimensionality reduction, and provide valuable insights into the sources of heterogeneity in a given dataset. In this paper, we propose a self-supervised graph-based approach for unsupervised feature selection. Our method's core involves computing robust pseudo-labels by applying simple processing steps to the graph Laplacian's eigenvectors. The subset of eigenvectors used for computing pseudo-labels is chosen based on a model stability criterion. We then measure the importance of each feature by training a surrogate model to predict the pseudo-labels from the observations. Our approach is shown to be robust to challenging scenarios, such as the presence of outliers and complex substructures. We demonstrate the effectiveness of our method through experiments on real-world datasets, showing its robustness across multiple domains, particularly its effectiveness on biological datasets.

Spectral Self-supervised Feature Selection

TL;DR

This paper addresses unsupervised feature selection in high-dimensional data by introducing a spectral self-supervised framework that uses robust pseudo-labels derived from graph-Laplacian eigenvectors. The core idea is to (a) generate discrete pseudo-labels from eigenvectors via binarization, (b) select a stable subset of eigenvectors using a model-variability criterion, and (c) score features by training surrogate models to predict these pseudo-labels, with a max-aggregation across the selected eigenvectors. The approach is supported by theory on eigenvector convergence in manifold settings and a product-manifold model, and it is demonstrated to be robust to outliers and complex substructures across eight real-world datasets, with notable effectiveness on biological data. The proposed SSFS framework also emphasizes interpretability and flexibility by allowing different surrogate models and by enabling stability-based validation of the selected features, which has practical impact for clustering and manifold-learning tasks in high-dimensional domains.

Abstract

Choosing a meaningful subset of features from high-dimensional observations in unsupervised settings can greatly enhance the accuracy of downstream analysis, such as clustering or dimensionality reduction, and provide valuable insights into the sources of heterogeneity in a given dataset. In this paper, we propose a self-supervised graph-based approach for unsupervised feature selection. Our method's core involves computing robust pseudo-labels by applying simple processing steps to the graph Laplacian's eigenvectors. The subset of eigenvectors used for computing pseudo-labels is chosen based on a model stability criterion. We then measure the importance of each feature by training a surrogate model to predict the pseudo-labels from the observations. Our approach is shown to be robust to challenging scenarios, such as the presence of outliers and complex substructures. We demonstrate the effectiveness of our method through experiments on real-world datasets, showing its robustness across multiple domains, particularly its effectiveness on biological datasets.
Paper Structure (39 sections, 2 theorems, 23 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 39 sections, 2 theorems, 23 equations, 11 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

For $n \to \infty$ and under assumptions (i)-(iii), with probability larger than $1-4K^2n^{-10} - (2K+6)n^{-9}$, the $k$-th eigenvector ${\bm{v}}_k$ of the unnormalized Laplacian satisfies where $\|{\bm{v}}_k\|=1$ and $|\alpha|=o(1)$.

Figures (11)

  • Figure 1: Illustration of SSFS. Panel (a) shows a tSNE scatter plot of noisy MNIST digits (3, 6, 8). Panel (b) presents the six leading eigenvectors computed based on the graph Laplacian of the data. Samples are ordered according to the identity of the digit. (c) We apply the $k$-medoids algorithm to compute pseudo-labels ${\bm{y}}_i^*$. These are presented as colors overlayed on the eigenvectors. (d) We select the three eigenvectors whose pseudo-labels are the most "stable" with respect to several prediction models (see Section \ref{['section:eigenvectors_proc_selection']}). (e) For each data feature we estimate its importance score for each of the selected eigenvectors (see Section \ref{['section:feature_selection']}). (f) We aggregate the feature scores across eigenvectors.
  • Figure 2: The first four Laplacian eigenvectors of two real datasets. Samples are sorted according to the class label and colored by the outcome of a one-dimensional $k$-medoids per eigenvector. The vertical bar indicates the separation between the classes. In Prostate-GE, ${\bm{v}}_4$ is the most informative to the class labels, and an outlier can be seen on the upper left in the third and fourth eigenvectors. In TOX-171, ${\bm{v}}_3$ is most informative to the class labels than ${\bm{v}}_2$.
  • Figure 3: Panel \ref{['fig:prod_manifolds_mnist38']} shows a scatter plot of the noisy MNIST dataset, containing digits 3 and 8, where each image is located according to its coordinates in the third and fourth eigenvectors. Panel \ref{['fig:convergence']} shows the leading eigenvector of a graph computed over $n$ points on a $1D$ interval and the leading eigenfunction $\cos(\pi x)$.
  • Figure 4: Panel (a) illustrates three features of a simulated dataset. Each feature is equal to a different polynomial of the same random latent variable $\theta_1$. Each point in the $3$D scatter plot is located according to the values of the three features and colored by the value of $\theta_1$. Panel (b) shows the eigenvectors of the graph Laplacian matrix. Each point is located according to the value of $(\theta_1,\theta_2)$ and colored by the value of its corresponding element in the eight leading eigenvectors. The eigenvectors are indexed by the vector $b$, whose elements $b_i$ determine the eigenvector order in the submanifold $\mathcal{M}^{(i)}$.
  • Figure 5: Clustering accuracy vs. the number of selected features on eight real-world datasets.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 1: Theorem 5.4 of cheng2022eigen
  • Theorem 2