Table of Contents
Fetching ...

Bayesian Joint Additive Factor Models for Multiview Learning

Niccolo Anceschi, Federico Ferrari, David B. Dunson, Himel Mallick

TL;DR

The paper addresses predicting clinical outcomes from multiview data by learning a structured latent representation that splits variation into shared and view-specific factors. It introduces Joint Additive Factor Regression (JAFAR) with a novel dependent cusp (D-CUSP) prior to achieve identifiability and employs a partially collapsed Gibbs sampler for efficient posterior inference, plus copula-based extensions to handle non-Gaussian data. Through extensive simulations and a labor-onset prediction application on high-dimensional multi-omics data, JAFAR demonstrates improved prediction, more reliable recovery of cross-view dependence, and interpretable latent factors compared to state-of-the-art alternatives. The method provides a practical framework for precision medicine and other multimodal settings, with an open-source R implementation for broad use.

Abstract

It is increasingly common in a wide variety of applied settings to collect data of multiple different types on the same set of samples. Our particular focus in this article is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. We propose a joint additive factor regression model (JAFAR) with a structured additive design, accounting for shared and view-specific components. We ensure identifiability via a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide an efficient implementation via a partially collapsed Gibbs sampler and extend our approach to allow flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (R package) is available at https://github.com/niccoloanceschi/jafar.

Bayesian Joint Additive Factor Models for Multiview Learning

TL;DR

The paper addresses predicting clinical outcomes from multiview data by learning a structured latent representation that splits variation into shared and view-specific factors. It introduces Joint Additive Factor Regression (JAFAR) with a novel dependent cusp (D-CUSP) prior to achieve identifiability and employs a partially collapsed Gibbs sampler for efficient posterior inference, plus copula-based extensions to handle non-Gaussian data. Through extensive simulations and a labor-onset prediction application on high-dimensional multi-omics data, JAFAR demonstrates improved prediction, more reliable recovery of cross-view dependence, and interpretable latent factors compared to state-of-the-art alternatives. The method provides a practical framework for precision medicine and other multimodal settings, with an open-source R implementation for broad use.

Abstract

It is increasingly common in a wide variety of applied settings to collect data of multiple different types on the same set of samples. Our particular focus in this article is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. We propose a joint additive factor regression model (JAFAR) with a structured additive design, accounting for shared and view-specific components. We ensure identifiability via a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide an efficient implementation via a partially collapsed Gibbs sampler and extend our approach to allow flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (R package) is available at https://github.com/niccoloanceschi/jafar.
Paper Structure (28 sections, 36 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 36 equations, 11 figures, 1 table, 2 algorithms.

Figures (11)

  • Figure 1: Mean squared error (in logarithmic scale) of the predicted responses in the test sets of simulated data. The x-axis reports increasing size of the train set. The interior points and band edges correspond to the quartiles over 10 independent replicates for fixed dimensions.
  • Figure 2: Inferred coefficients in the simulated data. The two columns report the mean absolute deviations and the empirical coverage of the $95\%$ credible intervals. The interior points and band edges correspond to the quartiles over 10 independent replicates for fixed dimensions. On the right, the horizontal blue corresponds to the correct coverage.
  • Figure 3: Frobenius norms of the differences between the true and inferred correlations for the simulated data. The two rows report the inter and intra-view correlations, respectively. All norms have been rescaled by the dimensions of the corresponding matrices The interior points and band edges correspond to the quartiles over 10 independent replicates for fixed dimensions.
  • Figure 4: Response prediction accuracy for the different methods considered. In the plots to the left, the dots and the vertical bars represent the expected values and the $95\%$ credible intervals for the predicted responses, respectively. The black horizontal ticks correspond to the true values of the response for each observation. Both CoopLearn and IntegLearn achieve good predictive performances in the train set, but do not perform as well on out-of-sample observations. bsfp performs worse both in terms of expected values and predictive intervals, which are almost as wide as the effective range of the response. jafar achieves remarkable generalization error on the test.
  • Figure 5: Inferred activity patterns in the specific and shared components loadings matrices $\{ {\mathbf \Gamma} _m\}_{m=1}^M$ and $\{ {\mathbf \Lambda} _m\}_{m=1}^M$ in the two additive factor models considered. Here the $M=3$ omics layers correspond to immunome, metabolome and proteome data respectively. For bsfp, the reported values correspond to the fixed ranks inferred via the unifac initialization. For jafar, they are posterior means of the number of active columns, according to the latent indicators in the d-cusp construction. jafar further allows for composite activity patterns in the shared component of the model, as disentangled in the Venn diagram in the bottom part of the Figure.
  • ...and 6 more figures