Spectral Self-supervised Feature Selection
Daniel Segal, Ofir Lindenbaum, Ariel Jaffe
TL;DR
This paper addresses unsupervised feature selection in high-dimensional data by introducing a spectral self-supervised framework that uses robust pseudo-labels derived from graph-Laplacian eigenvectors. The core idea is to (a) generate discrete pseudo-labels from eigenvectors via binarization, (b) select a stable subset of eigenvectors using a model-variability criterion, and (c) score features by training surrogate models to predict these pseudo-labels, with a max-aggregation across the selected eigenvectors. The approach is supported by theory on eigenvector convergence in manifold settings and a product-manifold model, and it is demonstrated to be robust to outliers and complex substructures across eight real-world datasets, with notable effectiveness on biological data. The proposed SSFS framework also emphasizes interpretability and flexibility by allowing different surrogate models and by enabling stability-based validation of the selected features, which has practical impact for clustering and manifold-learning tasks in high-dimensional domains.
Abstract
Choosing a meaningful subset of features from high-dimensional observations in unsupervised settings can greatly enhance the accuracy of downstream analysis, such as clustering or dimensionality reduction, and provide valuable insights into the sources of heterogeneity in a given dataset. In this paper, we propose a self-supervised graph-based approach for unsupervised feature selection. Our method's core involves computing robust pseudo-labels by applying simple processing steps to the graph Laplacian's eigenvectors. The subset of eigenvectors used for computing pseudo-labels is chosen based on a model stability criterion. We then measure the importance of each feature by training a surrogate model to predict the pseudo-labels from the observations. Our approach is shown to be robust to challenging scenarios, such as the presence of outliers and complex substructures. We demonstrate the effectiveness of our method through experiments on real-world datasets, showing its robustness across multiple domains, particularly its effectiveness on biological datasets.
