Clustering Survival Data using a Mixture of Non-parametric Experts

Gabriel Buginga; Edmundo de Souza e Silva

Clustering Survival Data using a Mixture of Non-parametric Experts

Gabriel Buginga, Edmundo de Souza e Silva

TL;DR

SurvMixClust addresses the need to jointly cluster individuals and predict time-to-event outcomes in right-censored data. It models data as a finite mixture of $K$ nonparametric survival distributions with covariate-driven mixing weights $\tau_k(\mathbf{x})$ learned via multinomial logistic regression, and cluster-specific survival functions estimated with Kaplan-Meier. An EM-based training procedure computes responsibilities and updates both the clustering and survival components, using a stochastic variant to improve scalability. Across five public datasets, the approach yields balanced, heterogeneous clusters with distinct survival curves and demonstrates competitive predictive performance relative to non-clustering survival models and superiority over clustering baselines in several settings, highlighting its potential for precision medicine and heterogeneous treatment effect analysis. The accompanying code enables practical adoption and further methodological development.

Abstract

Survival analysis aims to predict the timing of future events across various fields, from medical outcomes to customer churn. However, the integration of clustering into survival analysis, particularly for precision medicine, remains underexplored. This study introduces SurvMixClust, a novel algorithm for survival analysis that integrates clustering with survival function prediction within a unified framework. SurvMixClust learns latent representations for clustering while also predicting individual survival functions using a mixture of non-parametric experts. Our evaluations on five public datasets show that SurvMixClust creates balanced clusters with distinct survival curves, outperforms clustering baselines, and competes with non-clustering survival models in predictive accuracy, as measured by the time-dependent c-index and log-rank metrics.

Clustering Survival Data using a Mixture of Non-parametric Experts

TL;DR

SurvMixClust addresses the need to jointly cluster individuals and predict time-to-event outcomes in right-censored data. It models data as a finite mixture of

nonparametric survival distributions with covariate-driven mixing weights

learned via multinomial logistic regression, and cluster-specific survival functions estimated with Kaplan-Meier. An EM-based training procedure computes responsibilities and updates both the clustering and survival components, using a stochastic variant to improve scalability. Across five public datasets, the approach yields balanced, heterogeneous clusters with distinct survival curves and demonstrates competitive predictive performance relative to non-clustering survival models and superiority over clustering baselines in several settings, highlighting its potential for precision medicine and heterogeneous treatment effect analysis. The accompanying code enables practical adoption and further methodological development.

Abstract

Paper Structure (13 sections, 13 equations, 5 figures, 2 tables)

This paper contains 13 sections, 13 equations, 5 figures, 2 tables.

Introduction
The Model
Definition
Training with Expectation-Maximization
Training Algorithm
Experiments
Discussion
Related Works
Conclusion
Experimental Setup
Visualizing the survival functions
Hyperparameters
Mathematical Derivations

Figures (5)

Figure 1: Graph model representing independence assumptions for the main model. Notice how the features $X$ can only influence $T^{*}$ via the clusterization $Z$.
Figure 2: Test set's clusterization for the SUPPORT dataset returned by the models: SCA, K-means, and our proposal (SurvMixClust ). The initial row shows the Kaplan-Meier of the cluster's subpopulations and the calculated confidence intervals. The row below shows the same survival functions, but now, without the confidence intervals and the number of data points inside each cluster.
Figure 3: Time-dependent C-index across datasets and models. Each boxplot displays 20 samples.
Figure 4: Logrank score across datasets and models. Each boxplot displays 20 samples.
Figure 5: Inferred clusterizations generated by SurvMixClust . Each card corresponds to a dataset. A randomly selected trained model for each number of clusters is used to cluster the test set. The survival function of these populations, via Kaplan-Meier, is exhibited inside each card.

Clustering Survival Data using a Mixture of Non-parametric Experts

TL;DR

Abstract

Clustering Survival Data using a Mixture of Non-parametric Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)