MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

Zhiwei Wang; Fa Zhang; Cong Zheng; Xianghong Hu; Mingxuan Cai; Can Yang

MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

Zhiwei Wang, Fa Zhang, Cong Zheng, Xianghong Hu, Mingxuan Cai, Can Yang

TL;DR

MFAI addresses matrix completion under severe data quality by integrating gradient boosted trees with probabilistic matrix factorization, enabling nonlinear use of abundant auxiliary information. It combines a low-rank factorization $\mathbf{Y} = \mathbf{Z} \mathbf{W}^{\mathrm{T}} + \boldsymbol{\epsilon}$ with nonlinear priors $\mathbf{Z}_{\cdot k} \sim \mathcal{N}(F_k(\mathbf{X}), \beta_k^{-1}\mathbf{I}_N)$, where $F_k(\mathbf{X}) = \sum_{t=1}^{T_k} f_k^{t}(\mathbf{X})$. Inference uses a variational EM algorithm, updating Gaussian posteriors $q(\mathbf{Z})$, $q(\mathbf{W})$ and performing stage-wise boosting to refine $F(\cdot)$, with missing data handled via MAR assumptions and surrogate splits for covariates. The approach achieves superior imputation accuracy, robustly distinguishing useful auxiliary covariates, and scales to large datasets, as demonstrated on simulations and real data (MovieLens and brain gene expression), with a publicly available R package mfair. These results indicate substantial practical impact for improved matrix completion and for extracting insights from auxiliary information in diverse domains.

Abstract

In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary information. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.

MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

TL;DR

with nonlinear priors

, where

. Inference uses a variational EM algorithm, updating Gaussian posteriors

and performing stage-wise boosting to refine

, with missing data handled via MAR assumptions and surrogate splits for covariates. The approach achieves superior imputation accuracy, robustly distinguishing useful auxiliary covariates, and scales to large datasets, as demonstrated on simulations and real data (MovieLens and brain gene expression), with a publicly available R package mfair. These results indicate substantial practical impact for improved matrix completion and for extracting insights from auxiliary information in diverse domains.

Abstract

Paper Structure (29 sections, 84 equations, 8 figures, 5 tables, 4 algorithms)

This paper contains 29 sections, 84 equations, 8 figures, 5 tables, 4 algorithms.

Introduction
Methods
The MFAI Model
Fitting the MFAI Model
Approximate Bayesian Inference
Missing Data
Ranking the Importance of Auxiliary Covariates
The Multi-Factor MFAI Model
Numerical Experiments
Simulation Studies
Imputation Accuracy
Robustness
Computational Efficiency
Real Data Analyses
Data Description and Methods Setup
...and 14 more sections

Figures (8)

Figure 1: Boxplots comparing the accuracy of different methods. Experiment 1 involves the main matrix $\mathbf{Y}$ that varies from the weak signal (${\rm PVE} = 0.1$, left) to the strong signal (${\rm PVE} = 0.9$, right). Experiment 2 involves the main matrix $\mathbf{Y}$ that varies from low sparsity (${\rm missing \ ratio} = 0$, left) to high sparsity (${\rm missing \ ratio} = 0.9$, right). Accuracy is measured by the difference in each method's RMSE from the MFAI's RMSE, then divided by the MFAI's RMSE, with smaller values indicating higher accuracy. The $y$ axis is plotted on the square-root scale to avoid the plots being dominated by methods performed poorly.
Figure 2: Barplots for the importance scores of the auxiliary covariates in Factor 1-3. In the left panel (first three columns), we masked the main matrix $\mathbf{Y}$ randomly and varied from the low sparsity (${\rm missing \ ratio} = 0$, left) to high sparsity (${\rm missing \ ratio} = 0.9$, right). In the right panel (next three columns), we first fixed the missing ratio of $\mathbf{Y}$ as $0.5$, and further masked the auxiliary matrix $\mathbf{X}^\text{all}$ randomly and varied from the low sparsity (${\rm missing \ ratio} = 0$, left) to high sparsity (${\rm missing \ ratio} = 0.9$, right). The importance scores in each factor have been re-scaled to have a sum of one. The higher the importance score, the more the specific covariate contributes to improving the model.
Figure 3: Lineplots for the computation timings against data size. In the left panel, we fixed feature size $M$ and varied sample size $N$. In the right panel, we fixed sample size $N$ and varied feature size $M$. Different shapes of the points and colors of the lines represent the different numbers of auxiliary covariates (i.e., $C$) used in the model, respectively.
Figure 4: Boxplots comparing the accuracy of different methods in imputing missing entries. These two sets of experiments involve the main matrix $\mathbf{Y}$ that varies from rare information ($\text{training ratio} = 0.5$, left) to rich information ($\text{training ratio} = 0.9$, right). Accuracy is measured by the difference in each method's RMSE from the MFAI's RMSE, then divided by the MFAI's RMSE, with smaller values indicating higher accuracy. The $y$ axis is plotted on the square-root scale to avoid the plots being dominated by methods performed poorly.
Figure 5: Barplots for the importance scores of the auxiliary covariates in Factor 1-3 of the MovieLens 100K data. In the top left panel, we only used the true movie genre information $\mathbf{X}$ as the input. In the bottom left panel, we only used the re-permuted movie genre information $\mathbf{X}^{\text{pmt}}$ as the input. In the right panel, we used both the true and re-permuted movie genre information $\mathbf{X}^{\text{pmt}}$ as the input. The higher the importance score, the more a specific movie genre contributes to improving the model.
...and 3 more figures

MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

TL;DR

Abstract

MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

Authors

TL;DR

Abstract

Table of Contents

Figures (8)