Table of Contents
Fetching ...

BiSSLB: Binary Spike-and-Slab Lasso Biclustering

Sijian Fan, Ray Bai

Abstract

Biclustering is a powerful unsupervised learning technique for simultaneously identifying coherent subsets of rows and columns in a data matrix, thus revealing local patterns that may not be apparent in global analyses. However, most biclustering methods are developed for continuous data and are not applicable for binary datasets such as single-nucleotide polymorphism (SNP) or protein-protein interaction (PPI) data. Existing biclustering algorithms for binary data often struggle to recover biclustering patterns under noise, face scalability issues, and/or bias the final results towards biclusters of a particular size or characteristic. We propose a Bayesian method for biclustering binary datasets called Binary Spike-and-Slab Lasso Biclustering (BiSSLB). Our method is robust to noise and allows for overlapping biclusters of various sizes without prior knowledge of the noise level or bicluster characteristics. BiSSLB is based on a logistic matrix factorization model with spike-and-slab priors on the latent spaces. We further incorporate an Indian Buffet Process (IBP) prior to automatically determine the number of biclusters from the data. We develop a novel coordinate ascent algorithm with proximal steps which allows for scalable computation. The performance of our proposed approach is assessed through simulations and two real applications on HapMap SNP and Homo Sapiens PPI data, where BiSSLB is shown to outperform other state-of-the-art binary biclustering methods when the data is very noisy.

BiSSLB: Binary Spike-and-Slab Lasso Biclustering

Abstract

Biclustering is a powerful unsupervised learning technique for simultaneously identifying coherent subsets of rows and columns in a data matrix, thus revealing local patterns that may not be apparent in global analyses. However, most biclustering methods are developed for continuous data and are not applicable for binary datasets such as single-nucleotide polymorphism (SNP) or protein-protein interaction (PPI) data. Existing biclustering algorithms for binary data often struggle to recover biclustering patterns under noise, face scalability issues, and/or bias the final results towards biclusters of a particular size or characteristic. We propose a Bayesian method for biclustering binary datasets called Binary Spike-and-Slab Lasso Biclustering (BiSSLB). Our method is robust to noise and allows for overlapping biclusters of various sizes without prior knowledge of the noise level or bicluster characteristics. BiSSLB is based on a logistic matrix factorization model with spike-and-slab priors on the latent spaces. We further incorporate an Indian Buffet Process (IBP) prior to automatically determine the number of biclusters from the data. We develop a novel coordinate ascent algorithm with proximal steps which allows for scalable computation. The performance of our proposed approach is assessed through simulations and two real applications on HapMap SNP and Homo Sapiens PPI data, where BiSSLB is shown to outperform other state-of-the-art binary biclustering methods when the data is very noisy.
Paper Structure (19 sections, 1 theorem, 49 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 1 theorem, 49 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Let $\widehat{\mathbf{A}}$ denote the global mode of the proximal operator BiSSLB-proximal-operator, with $(i,k)$th entry $\widehat{a}_{ik}$, and let $\widetilde{\theta}_{ik}$ be defined as in BiSSLB-thetaik. Then where $\mathbf{Z} = \mathbf{A}^{(t-1)} - \eta \nabla f(\mathbf{A}^{(t-1)})$ with $(i,k)$th entry $z_{ik}$, $\Delta \equiv \inf _{t>0} [t / 2-\eta \operatorname{pen} ( t \mid \widetilde{\

Figures (4)

  • Figure 1: Comparison of the different biclustering algorithms in Simulation I (top four panels) and Simulation II (bottom four panels) in terms of consensus error (CE), consensus score (CS), relevance, and recovery. Each of these metrics is plotted against the noise level. The results displayed were averaged across 50 Monte Carlo replicates.
  • Figure 2: Results for recovery of the true bicluster number ($K=15$) in Simulation I (left panel) and Simulation II (right panel) across different noise levels. In these plots, the vertical axis is $\log_{15}\widehat{K}$, where $\widehat{K}$ is the estimated number of biclusters for each algorithm. Thus, $\log_{15}\widehat{K}=1$ indicates perfect recovery of the number of biclusters. The results displayed were averaged across 50 Monte Carlo replicates.
  • Figure 3: Results from fitting BiSSLB to the HapMap data. Left panel: the reordered latent factor matrix $\mathbf{A}$ with the three biclusters that BiSSLB found. Each row represents one sample. Middle panel: the reordered SNP data matrix where the rows correspond to the samples and the columns correspond to the SNP genotypes. A red square indicates a mutant ("1"), and a white square indicates a wild-type and ("0"). Right panel: the ancestry information (Blue = Caucasians, White = Asians, Red = Africans).
  • Figure 4: A $200 \times 200$ submatrix of the original PPI data, where black cells indicate observed interactions ("1") and white cells represent non-interactions ("0"). The biclusters estimated by BiSSLB are overlaid in red.

Theorems & Definitions (1)

  • Proposition 1