Table of Contents
Fetching ...

Doubly Non-Central Beta Matrix Factorization for Stable Dimensionality Reduction of Bounded Support Matrix Data

Anjali N. Albert, Patrick Flaherty, Aaron Schein

TL;DR

Empirical results show that the method has similar performance as other state-of-the-art approaches in terms of held-out prediction and computational complexity, but has significantly better performance in terms of stability to changes in hyper-parameters.

Abstract

We consider the problem of developing interpretable and computationally efficient matrix decomposition methods for matrices whose entries have bounded support. Such matrices are found in large-scale DNA methylation studies and many other settings. Our approach decomposes the data matrix into a Tucker representation wherein the number of columns in the constituent factor matrices is not constrained. We derive a computationally efficient sampling algorithm to solve for the Tucker decomposition. We evaluate the performance of our method using three criteria: predictability, computability, and stability. Empirical results show that our method has similar performance as other state-of-the-art approaches in terms of held-out prediction and computational complexity, but has significantly better performance in terms of stability to changes in hyper-parameters. The improved stability results in higher confidence in the results in applications where the constituent factors are used to generate and test scientific hypotheses such as DNA methylation analysis of cancer samples.

Doubly Non-Central Beta Matrix Factorization for Stable Dimensionality Reduction of Bounded Support Matrix Data

TL;DR

Empirical results show that the method has similar performance as other state-of-the-art approaches in terms of held-out prediction and computational complexity, but has significantly better performance in terms of stability to changes in hyper-parameters.

Abstract

We consider the problem of developing interpretable and computationally efficient matrix decomposition methods for matrices whose entries have bounded support. Such matrices are found in large-scale DNA methylation studies and many other settings. Our approach decomposes the data matrix into a Tucker representation wherein the number of columns in the constituent factor matrices is not constrained. We derive a computationally efficient sampling algorithm to solve for the Tucker decomposition. We evaluate the performance of our method using three criteria: predictability, computability, and stability. Empirical results show that our method has similar performance as other state-of-the-art approaches in terms of held-out prediction and computational complexity, but has significantly better performance in terms of stability to changes in hyper-parameters. The improved stability results in higher confidence in the results in applications where the constituent factors are used to generate and test scientific hypotheses such as DNA methylation analysis of cancer samples.

Paper Structure

This paper contains 30 sections, 3 theorems, 39 equations, 18 figures, 1 table.

Key Result

Lemma 3

$\mathbb{E}[\beta_{ij}]$ has analytic closed form, equal to: where $\textrm{M}(a,b,c)$ is Kummer's confluent hypergeometric function buchholz2013confluent.

Figures (18)

  • Figure 1: The doubly non-central beta (DNCB) distribution can assume a shape like the standard beta (left). Alternatively, the DNCB distribution can take a multi-modal shape if $\epsilon_1 < 1$ or $\epsilon_2 < 1$ (right). This expressiveness is particularly useful when modeling DNA methylation datasets, which are typically highly dispersed and multi-modal.
  • Figure 2: A graphical comparison of the DNCB-MF and DNCB-TD generative processes. The plate notation represents exchangeability across the specified indices. Shaded nodes are observed variables; unshaded nodes are latent variables. Solid edges denote random variables; dotted edges denote deterministic variables.
  • Figure 3: Prior predictive checks for DNCB-TD on three datasets.
  • Figure 4: Heldout prediction results on three datasets; higher is better. Random test-train splits were generated by creating three binary masks, each holding out a random 10% of the data. Using three random initializations for each model, we fit the models on the training data and imputed the held-out values, varying K across 6 values. We plot the rescaled pointwise predictive density (PPD) obtained by each model; the error bars denote 95% confidence intervals. All three models perform comparably well on held-out prediction.
  • Figure 5: Stability results for BG-NMF, DNCB-MF, and DNCB-TD on bisulfite sequencing methylation data. DNCB-TD is the only model for which the stability of both cluster and pathway assignments remains relatively constant as factor matrix cardinality increases.
  • ...and 13 more figures

Theorems & Definitions (5)

  • Definition 1: Doubly non-central beta distribution
  • Definition 2: Proportion-sum independence of gammas lukacs1955characterization
  • Lemma 3
  • Lemma 4
  • Theorem 5