Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection
Louis Ohl, Pierre-Alexandre Mattei, Charles Bouveyron, Mickaël Leclercq, Arnaud Droit, Frédéric Precioso
TL;DR
Sparse GEMINI tackles the problem of joint feature selection and discriminative clustering in high-dimensional data by coupling the geometry-aware GEMINI objective with sparsity-inducing penalties. It supports both linear (logistic) and neural (LassoNet) architectures, enabling end-to-end training through proximal gradients and explicit GEMINI gradients. The method demonstrates competitive clustering performance (ARI) while delivering improved variable selection (VSER/CVR) on synthetic and real datasets, including MNIST variants and a large Prostate-BCR transcriptomics dataset. A public GemClus package provides exact gradient computations for reproducibility and broader adoption of the approach.
Abstract
Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple l1 penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.
