Table of Contents
Fetching ...

Disentangling Shared and Target-Enriched Topics via Background-Contrastive Non-negative Matrix Factorization

Yixuan Li, Archer Y. Yang, Yue Li

TL;DR

Background contrastive Non-negative Matrix Factorization (\model), which extracts target-enriched latent topics by jointly factorizing a target dataset and a matched background using shared non-negative bases under a contrastive objective that suppresses background-expressed structure.

Abstract

Biological signals of interest in high-dimensional data are often masked by dominant variation shared across conditions. This variation, arising from baseline biological structure or technical effects, can prevent standard dimensionality reduction methods from resolving condition-specific structure. The challenge is that these confounding topics are often unknown and mixed with biological signals. Existing background correction methods are either unscalable to high dimensions or not interpretable. We introduce background contrastive Non-negative Matrix Factorization (\model), which extracts target-enriched latent topics by jointly factorizing a target dataset and a matched background using shared non-negative bases under a contrastive objective that suppresses background-expressed structure. This approach yields non-negative components that are directly interpretable at the feature level, and explicitly isolates target-specific variation. \model is learned by an efficient multiplicative update algorithm via matrix multiplication such that it is highly efficient on GPU hardware and scalable to big data via minibatch training akin to deep learning approach. Across simulations and diverse biological datasets, \model reveals signals obscured by conventional methods, including disease-associated programs in postmortem depressive brain single-cell RNA-seq, genotype-linked protein expression patterns in mice, treatment-specific transcriptional changes in leukemia, and TP53-dependent drug responses in cancer cell lines.

Disentangling Shared and Target-Enriched Topics via Background-Contrastive Non-negative Matrix Factorization

TL;DR

Background contrastive Non-negative Matrix Factorization (\model), which extracts target-enriched latent topics by jointly factorizing a target dataset and a matched background using shared non-negative bases under a contrastive objective that suppresses background-expressed structure.

Abstract

Biological signals of interest in high-dimensional data are often masked by dominant variation shared across conditions. This variation, arising from baseline biological structure or technical effects, can prevent standard dimensionality reduction methods from resolving condition-specific structure. The challenge is that these confounding topics are often unknown and mixed with biological signals. Existing background correction methods are either unscalable to high dimensions or not interpretable. We introduce background contrastive Non-negative Matrix Factorization (\model), which extracts target-enriched latent topics by jointly factorizing a target dataset and a matched background using shared non-negative bases under a contrastive objective that suppresses background-expressed structure. This approach yields non-negative components that are directly interpretable at the feature level, and explicitly isolates target-specific variation. \model is learned by an efficient multiplicative update algorithm via matrix multiplication such that it is highly efficient on GPU hardware and scalable to big data via minibatch training akin to deep learning approach. Across simulations and diverse biological datasets, \model reveals signals obscured by conventional methods, including disease-associated programs in postmortem depressive brain single-cell RNA-seq, genotype-linked protein expression patterns in mice, treatment-specific transcriptional changes in leukemia, and TP53-dependent drug responses in cancer cell lines.
Paper Structure (28 sections, 1 theorem, 51 equations, 8 figures)

This paper contains 28 sections, 1 theorem, 51 equations, 8 figures.

Key Result

Lemma 1

For any current iterate $H_X^{(t)} > 0$, define Then $G\bigl(\cdot, H_X^{(t)}\bigr)$ is an majorization function for $J(\cdot)$ on the non-negative orthant.

Figures (8)

  • Figure 1: Schematic overview of the bcNMF framework. The figure illustrates the joint factorization of target and background datasets using bcNMF, with both datasets decomposed into shared non-negative topics. Each sample is represented by its topic activations.
  • Figure 2: Simulation results on blended MNIST-ImageNet data. a Schematic illustrating the construction of the target and background datasets. The target dataset comprises MNIST digits superimposed on ImageNet background patches, while the background dataset contains only natural images processed identically. b UMAP visualization of the coefficient matrix $H_X$ learned from standard NMF (top) and bcNMF (bottom) on the ten-digit dataset. c Interpretability of bcNMF topics. Each column of the basis matrix $W$ is visualized as a $28 \times 28$ image (left), with topic usage for each digit represented as bubble plots (center). The mean reconstructed image for each digit, obtained by combining $W$ with the corresponding average topic vector, is shown on the right. d Quantitative comparison of clustering performance. ARI scores for NMF, cPCA, and bcNMF on the binary and ten-digit tasks, with error bars indicating bootstrap standard errors estimated from repeated subsampling of the test data.
  • Figure 3: Down syndrome-associated protein expression in mice. a UMAP embedding of NMF topic usage for mouse protein expression data, colored by Down syndrome (DS) status. b UMAP embedding of bcNMF topic usage using control mice as background, colored by DS status. c Bar plot of ARI for DS classification using PCA, cPCA, NMF and bcNMF. Error bars indicate bootstrap standard errors estimated from 20 subsamples with replacement (300 target and 200 background samples per subsample). d Top ten proteins for each of the two most significant bcNMF topics. Heatmaps show protein loadings (left) and signed $-\log_{10} p$ values from Welch’s two-sample t-tests comparing DS and non-DS groups (right).
  • Figure 4: bcNMF resolves condition-specific programs in single-cell RNA-seq from a leukemia transplant patient. a UMAP embedding of NMF topic usage ($K$ = 10), colored by condition (pre-transplant, blue; post-transplant, orange). b UMAP embedding of bcNMF topic usage using healthy bone marrow as background, colored by condition. c ARI for transplant status classification using PCA, cPCA, NMF and bcNMF. Error bars denote bootstrap standard errors from 20 stratified subsamples ($n$ = 6,000 cells per subsample). d Top five genes for each of the four most significant bcNMF topics. Heatmaps show gene loadings (left) and signed $-\log_{10} p$ values from Welch’s two-sample t-tests comparing pre- and post-transplant cells (right). e Box plots of topic usage for two bcNMF topics (Topics 8 and 10) in pre- and post-transplant cells; each point represents a single cell.
  • Figure 5: bcNMF isolates TP53-dependent transcriptional responses in the MIX-seq idasanutlin dataset. a UMAP embedding of NMF topic usage (K = 20), colored by cell line identity. b UMAP embedding of NMF topic usage, colored by TP53 mutation status (wild-type, mutant). c UMAP embedding of bcNMF topic usage (K = 20) using wild-type cell lines as background, colored by cell line identity. d UMAP embedding of bcNMF topic usage, colored by TP53 mutation status. e Top five genes for each of the three most significant bcNMF topics. Heatmaps show gene loadings (left) and signed $-\log_{10} p$ values from Welch’s two-sample t-tests comparing TP53 mutant and wild-type cells (right); bottom bars show topic-level significance. f Cell-line-specific genes from bcNMF topics enriched in wild-type clusters, excluding canonical TP53 targets. Heatmaps show gene loadings and Welch’s t-test-based significance scores. g Histogram of cell line composition within wild-type TP53 clusters identified in the bcNMF embedding. h Frequency of genes identified across the three most significant bcNMF topics 4, 8 and 19, with canonical TP53 targets highlighted in red. i Schematic of the TP53 regulatory network linking identified target genes. j ARI for TP53 mutation status alignment using PCA, cPCA, NMF and bcNMF. Bars show ARI computed on the full dataset; error bars indicate bootstrap standard errors estimated from 20 resampled target/background pairs
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1: Majorization function
  • Lemma 1: Auxiliary function for Euclidean NMF in $H$