Table of Contents
Fetching ...

Bayesian Profile Regression using Variational Inference to Identify Clusters of Multiple Long-Term Conditions Conditioning on Mortality in Population-Scale Data

James Rafferty, Keith R Abrams, Munir Pirmohamed, Mark Davies, Rhiannon K Owen

TL;DR

These findings show that SVI can be a useful and accurate method for fitting Bayesian models, especially when the dataset size would make Monte Carlo methods prohibitively time consuming or impossible.

Abstract

Multiple long-term conditions (MLTC) are increasingly observed in clinical practice globally. Clustering methods to group diseases into commonly co-occurring clusters have been of interest for further understanding of how MLTC group together and their associated impact on patient outcomes. However, such approaches require large, often population-scale datasets. Bayesian Profile Regression (BPR) is a statistical model that combines a Dirichlet Process Mixture model with a hierarchical regression model, in order to form clusters of items conditional on covariates and an outcome of interest. We developed a BPR model using full-rank Stochastic Variational Inference (SVI) for application in large-scale data. We assessed it's performance using simulation studies comparing fits using the No-U-turn (NUTS) sampler and full-rank SVI. We then fit a BPR model to find clusters of MLTC in a population-scale data held in the Secure Anonymised Information Linkage (SAIL) databank. We found results from full-rank SVI compared well with results from NUTS in a simulation study, and the improved fitting performance allowed for fitting models in population-scale datasets. There were 1,296,463 individuals in our electronic health record (EHR) cohort. The clustering model was conditioned on age at cohort entry, socioeconomic deprivation and sex with mortality as the outcome. We used the Elixhauser comorbidity index disease definitions, and found there were 33 disease clusters. We found that clusters featuring metastatic cancer and cardiovascular diseases, such as congestive heart failure, were most strongly associated with the probability of mortality. Our findings show that SVI can be a useful and accurate method for fitting Bayesian models, especially when the dataset size would make Monte Carlo methods prohibitively time consuming or impossible.

Bayesian Profile Regression using Variational Inference to Identify Clusters of Multiple Long-Term Conditions Conditioning on Mortality in Population-Scale Data

TL;DR

These findings show that SVI can be a useful and accurate method for fitting Bayesian models, especially when the dataset size would make Monte Carlo methods prohibitively time consuming or impossible.

Abstract

Multiple long-term conditions (MLTC) are increasingly observed in clinical practice globally. Clustering methods to group diseases into commonly co-occurring clusters have been of interest for further understanding of how MLTC group together and their associated impact on patient outcomes. However, such approaches require large, often population-scale datasets. Bayesian Profile Regression (BPR) is a statistical model that combines a Dirichlet Process Mixture model with a hierarchical regression model, in order to form clusters of items conditional on covariates and an outcome of interest. We developed a BPR model using full-rank Stochastic Variational Inference (SVI) for application in large-scale data. We assessed it's performance using simulation studies comparing fits using the No-U-turn (NUTS) sampler and full-rank SVI. We then fit a BPR model to find clusters of MLTC in a population-scale data held in the Secure Anonymised Information Linkage (SAIL) databank. We found results from full-rank SVI compared well with results from NUTS in a simulation study, and the improved fitting performance allowed for fitting models in population-scale datasets. There were 1,296,463 individuals in our electronic health record (EHR) cohort. The clustering model was conditioned on age at cohort entry, socioeconomic deprivation and sex with mortality as the outcome. We used the Elixhauser comorbidity index disease definitions, and found there were 33 disease clusters. We found that clusters featuring metastatic cancer and cardiovascular diseases, such as congestive heart failure, were most strongly associated with the probability of mortality. Our findings show that SVI can be a useful and accurate method for fitting Bayesian models, especially when the dataset size would make Monte Carlo methods prohibitively time consuming or impossible.
Paper Structure (21 sections, 10 equations, 5 figures, 5 tables)

This paper contains 21 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Parameter bias and 95% coverage computed by simulation. Top left: Biases of the estimated gradients in the response component. Top right: Biases of estimated probabilities in the mixture component on the log-odds scale, plotted against the value of the $\phi$ parameter. Bottom left: Coverage estimates of gradient parameters in the response component. Bottom right: Coverage estimates in the mixture component, plotted against the value of the $\phi$ parameter.
  • Figure 2: Parameter 95% coverage estimates by simulation as a function of batch size in SVI fits. Left: Coverage estimates for the gradient parameters in the response component. Right: Coverage estimates for the $\phi$ parameters in the mixture model. Horizontal lines show coverage = 95%
  • Figure 3: A heatmap showing point estimates for the log-odds of disease and the outcome for each cluster. The outcome is calculated for women and men at the mean age of 42.5 years and living in the third deprivation quintile. A plot with the colours on a linear scale is in Supplemental Section \ref{['sec:lin_scale']}.
  • Figure S.1: A heatmap showing point estimates for the log-odds of disease and the outcome for each cluster, stratified by age at cohort entry. Top left: $20 < \mathrm{age} \leq 40$ years. Top right: $40 < \mathrm{age} \leq 60$ years. Lower left: $60 < \mathrm{age} \leq 80$ years. Lower right: $80 < \mathrm{age} \leq 100$ years. The outcome is calculated at the mean age each stratum and living in the third deprivation quintile.
  • Figure S.2: A heatmap showing point estimates for the probability of disease and the outcome for each cluster.