Table of Contents
Fetching ...

bayesNMF: Fast Bayesian Poisson NMF with Automatically Learned Rank Applied to Mutational Signatures

Jenna M. Landy, Nishanth Basava, Giovanni Parmigiani

Abstract

Bayesian Poisson Non-Negative Matrix Factorization (NMF) is widely used to model count data, including in cancer mutational signature analysis. However, standard Gibbs samplers rely on computationally expensive Poisson augmentation, and current software implementations learn the latent rank either through slow and potentially subjective heuristic rank selection or with automatic approaches that do not report posterior uncertainty. In this paper, we introduce bayesNMF, an MH-within-Gibbs sampler to address both of these limitations. First, we define high-overlap proposals for Metropolis-Hastings sampling to remove the need for Poisson augmentation. Second, we define a BIC-based sparsity prior to learn rank automatically within the Bayesian formulation while allowing for posterior uncertainty quantification. We provide an open-source R software package with all of the models and plotting capabilities demonstrated in this paper on GitHub at jennalandy/bayesNMF. Although our applications focus on cancer mutational signatures, our software and results can be extended to any use of Bayesian Poisson NMF.

bayesNMF: Fast Bayesian Poisson NMF with Automatically Learned Rank Applied to Mutational Signatures

Abstract

Bayesian Poisson Non-Negative Matrix Factorization (NMF) is widely used to model count data, including in cancer mutational signature analysis. However, standard Gibbs samplers rely on computationally expensive Poisson augmentation, and current software implementations learn the latent rank either through slow and potentially subjective heuristic rank selection or with automatic approaches that do not report posterior uncertainty. In this paper, we introduce bayesNMF, an MH-within-Gibbs sampler to address both of these limitations. First, we define high-overlap proposals for Metropolis-Hastings sampling to remove the need for Poisson augmentation. Second, we define a BIC-based sparsity prior to learn rank automatically within the Bayesian formulation while allowing for posterior uncertainty quantification. We provide an open-source R software package with all of the models and plotting capabilities demonstrated in this paper on GitHub at jennalandy/bayesNMF. Although our applications focus on cancer mutational signatures, our software and results can be extended to any use of Bayesian Poisson NMF.

Paper Structure

This paper contains 72 sections, 41 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 4.1: Illustration of software package capabilities using bayesNMF Poisson-Truncated Normal+MH SBFI on simulated data. A. Posterior diagnostic traceplots. B. Reference assignment using posterior ensemble with majority voting. C. Visualization suite, including similarity heatmaps, contribution summaries, and reconstructed signatures (bar chart of aligned reference, points for final estimates, and error bars for 95% credible intervals).
  • Figure 4.2: Relative performance of Bayesian NMF models. Grey fill represents Gamma priors on standard Poisson models, but Truncated Normal priors on Poisson+MH or Normal models. White fill always indicates Exponential priors. A. Agreement of $\hat{E}$ (left) and $\hat{P}$ (right) between models as signature-wise minimum cosine similarities. Excluding 0 to 7 values per density for similarity $<$0.95 for visual clarity (see Appendix Table C.1). B. Efficiency gain of Poisson+MH relative to standard Poisson. C. Performance of novel Poisson+MH models. Metrics for all models are available in Appendix Figure C.1.
  • Figure 5.1: Rank bias, precision, sensitivity, and time (log scale) of rank learning approaches: bayesNMF with SBFI (upwards triangle) and minBIC (square), as well as SignatureAnalyzer with ARD (downwards triangle), each with Truncated Normal (filled) and Exponential (empty) priors. Plotted are large shapes for medians on top of lighter boxplots and jittered outliers. Full results for all ranks between 1 and 20 are available in Appendix Figure C.3. Sensitivity is the proportion of true signatures for which there is an estimated signature with cosine similarity $>$0.9. Precision is the proportion of estimated signatures for which there is a true signature with cosine similarity $>$0.9.
  • Figure 6.1: Results of bayesNMF+SBFI (upwards triangle) and SignatureAnalyzer+ARD (downwards triangle) on PCAWG histology groups. Skin melanoma is excluded for visual clarity because SignatureAnalyzer estimates a high rank of 76 (bayesNMF estimates a rank of 10). A. COSMIC reference signatures aligned to $\hat{P}$. Reference signatures not found with either method are excluded. B. Estimated latent ranks. Exact values in Appendix Table D.1.
  • Figure B.1: Running metrics plot for an example simulated dataset with fixed rank. Vertical blue line indicates convergence (see Section \ref{['sec:convergence']}) and thus the switch from "accept all" to true MH samples (see Section \ref{['sec:warmup']}). The highlighted light blue rectangle indicates samples used for final inference.
  • ...and 10 more figures