Table of Contents
Fetching ...

Bipartite Graph Attention-based Clustering for Large-scale scRNA-seq Data

Zhuomin Liang, Liang Bai, Xian Yang

TL;DR

BGFormer addresses the quadratic scalability of Transformer-based clustering on large-scale scRNA-seq data by introducing a small set of learnable anchor tokens and a bipartite graph attention mechanism that couples cells to anchors. This design yields a time complexity of $\mathcal{O}(n m d)$ and memory $\mathcal{O}(n m)$, where $m \ll n$, enabling efficient clustering across datasets with hundreds of thousands of cells. The model optimizes a joint objective comprising anchor reconstruction (via a ZINB model) and DEC-based clustering, augmented by self-supervised and commitment losses, and demonstrates superior clustering accuracy and efficiency compared to a broad range of baselines. Theoretical analysis based on the Johnson-Lindenstrauss lemma supports the approximation of full self-attention by the anchor-based bipartite approach, while experiments on multiple large scRNA-seq datasets show BGFormer’s effectiveness, scalability, and robust embeddings suitable for downstream biological interpretation.

Abstract

scRNA-seq clustering is a critical task for analyzing single-cell RNA sequencing (scRNA-seq) data, as it groups cells with similar gene expression profiles. Transformers, as powerful foundational models, have been applied to scRNA-seq clustering. Their self-attention mechanism automatically assigns higher attention weights to cells within the same cluster, enhancing the distinction between clusters. Existing methods for scRNA-seq clustering, such as graph transformer-based models, treat each cell as a token in a sequence. Their computational and space complexities are $\mathcal{O}(n^2)$ with respect to the number of cells, limiting their applicability to large-scale scRNA-seq datasets.To address this challenge, we propose a Bipartite Graph Transformer-based clustering model (BGFormer) for scRNA-seq data. We introduce a set of learnable anchor tokens as shared reference points to represent the entire dataset. A bipartite graph attention mechanism is introduced to learn the similarity between cells and anchor tokens, bringing cells of the same class closer together in the embedding space. BGFormer achieves linear computational complexity with respect to the number of cells, making it scalable to large datasets. Experimental results on multiple large-scale scRNA-seq datasets demonstrate the effectiveness and scalability of BGFormer.

Bipartite Graph Attention-based Clustering for Large-scale scRNA-seq Data

TL;DR

BGFormer addresses the quadratic scalability of Transformer-based clustering on large-scale scRNA-seq data by introducing a small set of learnable anchor tokens and a bipartite graph attention mechanism that couples cells to anchors. This design yields a time complexity of and memory , where , enabling efficient clustering across datasets with hundreds of thousands of cells. The model optimizes a joint objective comprising anchor reconstruction (via a ZINB model) and DEC-based clustering, augmented by self-supervised and commitment losses, and demonstrates superior clustering accuracy and efficiency compared to a broad range of baselines. Theoretical analysis based on the Johnson-Lindenstrauss lemma supports the approximation of full self-attention by the anchor-based bipartite approach, while experiments on multiple large scRNA-seq datasets show BGFormer’s effectiveness, scalability, and robust embeddings suitable for downstream biological interpretation.

Abstract

scRNA-seq clustering is a critical task for analyzing single-cell RNA sequencing (scRNA-seq) data, as it groups cells with similar gene expression profiles. Transformers, as powerful foundational models, have been applied to scRNA-seq clustering. Their self-attention mechanism automatically assigns higher attention weights to cells within the same cluster, enhancing the distinction between clusters. Existing methods for scRNA-seq clustering, such as graph transformer-based models, treat each cell as a token in a sequence. Their computational and space complexities are with respect to the number of cells, limiting their applicability to large-scale scRNA-seq datasets.To address this challenge, we propose a Bipartite Graph Transformer-based clustering model (BGFormer) for scRNA-seq data. We introduce a set of learnable anchor tokens as shared reference points to represent the entire dataset. A bipartite graph attention mechanism is introduced to learn the similarity between cells and anchor tokens, bringing cells of the same class closer together in the embedding space. BGFormer achieves linear computational complexity with respect to the number of cells, making it scalable to large datasets. Experimental results on multiple large-scale scRNA-seq datasets demonstrate the effectiveness and scalability of BGFormer.
Paper Structure (21 sections, 2 theorems, 21 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 2 theorems, 21 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Theorem 5.1

For any $\bm{Q}_b \in \mathbb{R}^{n' \times d}$ and $\bm{K}, \bm{V} \in \mathbb{R}^{n \times d}$, for any column vector $\bm{\omega} \in \mathbb{R}^{n}$ of matrix $\bm{V}$, there exists a low-rank matrix $\bm{\tilde{A}_b} \in \mathbb{R}^{n' \times n}$ such that where $\bm{{A}}_b=softmax(\bm{Q}_b \bm{K}^T / \sqrt{d_k})$ is the attention matrix, $n'$ is the number of cells in a batch, $\epsilon > 0

Figures (7)

  • Figure 1: Comparison of different clustering methods.
  • Figure 2: Comparison of the attention mechanism in traditional Transformer and BGFormer-based clustering model.
  • Figure 3: The framework of BGFormer-based clustering model.
  • Figure 4: UMAP visualizations of cell embeddings, with colors indicating cell types.
  • Figure 5: Dot plots showing the expression of genes on the Bach dataset.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 5.1
  • Theorem 5.2