Bayesian Joint Additive Factor Models for Multiview Learning
Niccolo Anceschi, Federico Ferrari, David B. Dunson, Himel Mallick
TL;DR
The paper addresses predicting clinical outcomes from multiview data by learning a structured latent representation that splits variation into shared and view-specific factors. It introduces Joint Additive Factor Regression (JAFAR) with a novel dependent cusp (D-CUSP) prior to achieve identifiability and employs a partially collapsed Gibbs sampler for efficient posterior inference, plus copula-based extensions to handle non-Gaussian data. Through extensive simulations and a labor-onset prediction application on high-dimensional multi-omics data, JAFAR demonstrates improved prediction, more reliable recovery of cross-view dependence, and interpretable latent factors compared to state-of-the-art alternatives. The method provides a practical framework for precision medicine and other multimodal settings, with an open-source R implementation for broad use.
Abstract
It is increasingly common in a wide variety of applied settings to collect data of multiple different types on the same set of samples. Our particular focus in this article is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. We propose a joint additive factor regression model (JAFAR) with a structured additive design, accounting for shared and view-specific components. We ensure identifiability via a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide an efficient implementation via a partially collapsed Gibbs sampler and extend our approach to allow flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (R package) is available at https://github.com/niccoloanceschi/jafar.
