Table of Contents
Fetching ...

Clustering-Based Outcome Models for Clinical Studies: A Scoping Review

Johannes Vilsmeier, Fabian Eibensteiner, Franz König, Francois Mercier, Robin Ristl, Nigel Stallard, Marc Vandemeulebroecke, Sarah Zohar, Martin Posch

TL;DR

This review provides a systematic overview of methods that combine covariate-based clustering of observational units (patients) with outcome models for clinical studies and discusses applications to rare disease research, covariate adjustment and borrowing from historical data, and subgroup-specific treatment effect estimation in clinical trials.

Abstract

This review provides a systematic overview of methods that combine covariate-based clustering of observational units (patients) with outcome models for clinical studies. We distinguish between informed-cluster models, where the outcome contributes to cluster formation, and agnostic-cluster models, where clustering is performed solely on covariates in a separate first step. Informed-cluster models include product partition models with covariates (PPMx), finite mixtures of regression models (FMR), and cluster-aware supervised learning (CluSL). Agnostic-cluster models encompass two-step procedures using either model-based or algorithmic clustering followed by cluster-specific regression models. Following a systematic search of Web of Science and PubMed, 55 records were identified that propose or evaluate such models. We describe the key models, summarise study characteristics, and present applications from biomedical and public health research. Clustering-based outcome models are particularly relevant for settings with high-dimensional covariates (e.g., biomarker panels and "omics") and heterogeneous patient populations. These models can support risk stratification and we discuss extensions to estimate subgroup-specific treatment effects. They are most valuable when the population is clustered in distinct regions of the covariate space that correspond to different outcome distributions. We discuss applications to rare disease research, covariate adjustment and borrowing from historical data, and subgroup-specific treatment effect estimation in clinical trials.

Clustering-Based Outcome Models for Clinical Studies: A Scoping Review

TL;DR

This review provides a systematic overview of methods that combine covariate-based clustering of observational units (patients) with outcome models for clinical studies and discusses applications to rare disease research, covariate adjustment and borrowing from historical data, and subgroup-specific treatment effect estimation in clinical trials.

Abstract

This review provides a systematic overview of methods that combine covariate-based clustering of observational units (patients) with outcome models for clinical studies. We distinguish between informed-cluster models, where the outcome contributes to cluster formation, and agnostic-cluster models, where clustering is performed solely on covariates in a separate first step. Informed-cluster models include product partition models with covariates (PPMx), finite mixtures of regression models (FMR), and cluster-aware supervised learning (CluSL). Agnostic-cluster models encompass two-step procedures using either model-based or algorithmic clustering followed by cluster-specific regression models. Following a systematic search of Web of Science and PubMed, 55 records were identified that propose or evaluate such models. We describe the key models, summarise study characteristics, and present applications from biomedical and public health research. Clustering-based outcome models are particularly relevant for settings with high-dimensional covariates (e.g., biomarker panels and "omics") and heterogeneous patient populations. These models can support risk stratification and we discuss extensions to estimate subgroup-specific treatment effects. They are most valuable when the population is clustered in distinct regions of the covariate space that correspond to different outcome distributions. We discuss applications to rare disease research, covariate adjustment and borrowing from historical data, and subgroup-specific treatment effect estimation in clinical trials.
Paper Structure (23 sections, 25 equations, 4 figures, 5 tables)

This paper contains 23 sections, 25 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: PRISMA flow chart showing the literature search and screening process. The five records identified through other sources were either cited in included studies or known to the reviewers from prior work.
  • Figure 2: Categorisation of clustering-based outcome models identified in the review of 55 records. Models are first categorised by whether the outcome variable contributes to cluster formation (informed-cluster models) or whether clustering is performed solely on covariates (agnostic-clustering models). Agnostic-clustering models are further subdivided into algorithmic and model-based clustering approaches. Numbers in parentheses indicate counts in each category; totals exceed 55 because some records reported multiple methods. PPM/PPMx = Product partition models/product partition models with covariates; FMR = Finite mixtures of regression models; CluSL = Cluster-aware supervised learning. The category Other comprises cosine-similarity clustering gaoFeatureReductionText2012a, fuzzy $C$-means clustering and subtractive clustering poojamrHybridDecisionSupport2015, and a three-step clustering approach for longitudinal data nguyenMultivariateLongitudinalData2023.
  • Figure 3: Number of records by year range (period) and discipline of publication sources (49 journal articles, 5 conference papers, and 1 preprint).
  • Figure 4: Scatter plot of sample sizes ($n$) against number of covariates ($d$) across records using real data (48 records, left) and simulated data (23 records, right). Each record contributes up to two data points: one for the minimum number of covariates paired with the minimum sample size (light grey), and one for the maximum number of covariates paired with the maximum sample size (dark grey). The solid grey lines connect data points from the same study. The dashed line represents $n = d$. Both axes use $\log_{10}$ scales.