Table of Contents
Fetching ...

A scalable Bayesian functional factor model for high-dimensional longitudinal molecular data

Salima Jaoua, Daniel Temko, Hélène Ruffieux

Abstract

Large-scale longitudinal molecular profiling is now firmly established in biomedical research, prompted by the need to uncover coordinated biomarker trajectories reflecting the dynamics of underlying biological mechanisms and characterise patient heterogeneity in disease progression. While a range of statistical tools exist for either longitudinal modelling or high-dimensional analysis, there is no unified framework tailored to address these questions jointly. Motivated by a longitudinal COVID-19 study conducted in Cambridge hospitals, we propose a Bayesian functional factor model to address this gap. The framework combines latent factor modelling with functional principal component analysis to represent shared temporal programmes across subsets of variables while capturing individual variation through low-dimensional functional scores. We specify sparsity-inducing priors that yield interpretable factor structure and allow the effective number of factors to be inferred via overspecification. An annealed variational algorithm ensures efficient joint posterior inference at scale. The approach achieves accurate recovery of temporal structure in simulations with up to 20 000 variables. Application to the COVID-19 data reveals clinically meaningful heterogeneity in recovery dynamics through interpretable subject-level scores capturing coordinated inflammatory and immune-response pathway activity. The methodology is implemented in the R package bayesSYNC.

A scalable Bayesian functional factor model for high-dimensional longitudinal molecular data

Abstract

Large-scale longitudinal molecular profiling is now firmly established in biomedical research, prompted by the need to uncover coordinated biomarker trajectories reflecting the dynamics of underlying biological mechanisms and characterise patient heterogeneity in disease progression. While a range of statistical tools exist for either longitudinal modelling or high-dimensional analysis, there is no unified framework tailored to address these questions jointly. Motivated by a longitudinal COVID-19 study conducted in Cambridge hospitals, we propose a Bayesian functional factor model to address this gap. The framework combines latent factor modelling with functional principal component analysis to represent shared temporal programmes across subsets of variables while capturing individual variation through low-dimensional functional scores. We specify sparsity-inducing priors that yield interpretable factor structure and allow the effective number of factors to be inferred via overspecification. An annealed variational algorithm ensures efficient joint posterior inference at scale. The approach achieves accurate recovery of temporal structure in simulations with up to 20 000 variables. Application to the COVID-19 data reveals clinically meaningful heterogeneity in recovery dynamics through interpretable subject-level scores capturing coordinated inflammatory and immune-response pathway activity. The methodology is implemented in the R package bayesSYNC.
Paper Structure (24 sections, 76 equations, 14 figures, 1 table)

This paper contains 24 sections, 76 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Graphical representation of model (\ref{['eq_full_model']}). The shaded nodes are observed, the others are inferred.
  • Figure 2: Qualitative recovery in a large-scale simulated setting ($N=100$ subjects, $p=20\,000$ variables, $Q=3$ latent factors, $L=2$ FPCA components). Results are shown for a single simulated dataset. (a) Observed trajectories for a randomly selected subset of subjects and variables together with fitted trajectories and 95% pointwise prediction bands obtained from annealed variational inference. (b) Estimated loading matrix for the retained factors, displayed for the first 50 variables as an illustrative subset; rows correspond to variables and columns to latent factors. (c) Comparison of true and estimated eigenfunctions for each factor after post-processing and alignment. (d) Comparison of true and estimated subject-level FPCA scores.
  • Figure 3: Effect of annealing on estimation accuracy and factor selection in the large-scale simulation setting ($N=100$, $p=20\,000$, $Q=3$, $L=2$). Results are averaged over 25 independently generated datasets; error bars denote $\pm$ one standard error across replicates. (a) Vanilla variational Bayes (no annealing). From left to right: posterior factor inclusion probabilities (PPI$^{\text{factor}}_q$) under an overspecified model with $Q_{\max}=5$; variable inclusion performance for loadings (AUC based on variable-factor loading PPI$_{jq}$); mean function integrated squared error (ISE); and eigenfunction ISE for $\ell=1,2$, under $Q_{\mathrm{max}}=3$ (oracle) and $Q_{\mathrm{max}}=5$. Reported score correlations are means (standard errors) across replicates. (b) Annealed variational Bayes under the same specification, with identical performance metrics.
  • Figure 4: Optimisation behaviour with and without annealing, summarised over 25 simulation replicates. Left: Difference in final ELBO (annealed VB minus vanilla VB) under overspecified models with $Q_{\mathrm{max}}=3$ and $Q_{\mathrm{max}}=5$. Positive values indicate improved optima under annealing. Right: Number of iterations until convergence for vanilla VB and annealed VB under $Q_{\mathrm{max}}=3$ and $Q_{\mathrm{max}}=5$. Boxplots summarise the empirical distribution across replicates.
  • Figure 5: Comparison of our method (bayesSYNC) and the Gaussian process-based method with or without subject-specific latent factor trajectories (MEFISTO and MEFISTO-base). (a) Average ROC curves for recovery of loadings in the scenario with $N=30$ subjects, computed over 25 simulation replicates. The curves for MEFISTO and MEFISTO-base overlap. Error bars represent $\pm 1.96 \times$ standard error across replicates. (b) Runtime (minutes) as a function of sample size $N$, averaged over replicates on an Intel Xeon CPU, 2.60 GHz machine. Error bars denote $\pm$ one standard error, based on 25 replicates ($14$ replicates only for MEFISTO in the $N=40$-scenario). The runtimes displayed for bayesSYNC and MEFISTO-base overlap. All methods are applied with $Q_{\max}=5$; serial runtimes are reported for comparability.
  • ...and 9 more figures