Table of Contents
Fetching ...

Single-cell Curriculum Learning-based Deep Graph Embedding Clustering

Huifa Li, Jie Fu, Xinpeng Ling, Zhiyu Sun, Kuncan Wang, Zhili Chen

TL;DR

A single-cell curriculum learning-based deep graph embedding clustering (scCLG) that combines three optimization objectives, including topology reconstruction loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and clustering loss, to learn cell-cell topology representation is proposed.

Abstract

The swift advancement of single-cell RNA sequencing (scRNA-seq) technologies enables the investigation of cellular-level tissue heterogeneity. Cell annotation significantly contributes to the extensive downstream analysis of scRNA-seq data. However, The analysis of scRNA-seq for biological inference presents challenges owing to its intricate and indeterminate data distribution, characterized by a substantial volume and a high frequency of dropout events. Furthermore, the quality of training samples varies greatly, and the performance of the popular scRNA-seq data clustering solution GNN could be harmed by two types of low-quality training nodes: 1) nodes on the boundary; 2) nodes that contribute little additional information to the graph. To address these problems, we propose a single-cell curriculum learning-based deep graph embedding clustering (scCLG). We first propose a Chebyshev graph convolutional autoencoder with multi-criteria (ChebAE) that combines three optimization objectives, including topology reconstruction loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and clustering loss, to learn cell-cell topology representation. Meanwhile, we employ a selective training strategy to train GNN based on the features and entropy of nodes and prune the difficult nodes based on the difficulty scores to keep the high-quality graph. Empirical results on a variety of gene expression datasets show that our model outperforms state-of-the-art methods. The code of scCLG will be made publicly available at https://github.com/LFD-byte/scCLG.

Single-cell Curriculum Learning-based Deep Graph Embedding Clustering

TL;DR

A single-cell curriculum learning-based deep graph embedding clustering (scCLG) that combines three optimization objectives, including topology reconstruction loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and clustering loss, to learn cell-cell topology representation is proposed.

Abstract

The swift advancement of single-cell RNA sequencing (scRNA-seq) technologies enables the investigation of cellular-level tissue heterogeneity. Cell annotation significantly contributes to the extensive downstream analysis of scRNA-seq data. However, The analysis of scRNA-seq for biological inference presents challenges owing to its intricate and indeterminate data distribution, characterized by a substantial volume and a high frequency of dropout events. Furthermore, the quality of training samples varies greatly, and the performance of the popular scRNA-seq data clustering solution GNN could be harmed by two types of low-quality training nodes: 1) nodes on the boundary; 2) nodes that contribute little additional information to the graph. To address these problems, we propose a single-cell curriculum learning-based deep graph embedding clustering (scCLG). We first propose a Chebyshev graph convolutional autoencoder with multi-criteria (ChebAE) that combines three optimization objectives, including topology reconstruction loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and clustering loss, to learn cell-cell topology representation. Meanwhile, we employ a selective training strategy to train GNN based on the features and entropy of nodes and prune the difficult nodes based on the difficulty scores to keep the high-quality graph. Empirical results on a variety of gene expression datasets show that our model outperforms state-of-the-art methods. The code of scCLG will be made publicly available at https://github.com/LFD-byte/scCLG.
Paper Structure (23 sections, 16 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 16 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Framework of scCLG. (A) Pre-training: pretraining the proposed ChebAE with adjacency matrix decoder and ZINB decoder. Then calculate node difficulty using a hierarchical difficulty measurer and prune the data. (B) Formal training: using all three criterias to optimize the model in more detail from easy to hard pattern with pruned data.
  • Figure 2: The model architecture of multi-criteria ChebAE. ChebAE integrates three loss components: reconstruction loss, ZINB loss, and a clustering loss to optimize the low-dimensional latent representation.
  • Figure 3: Parameter analysis. (A) Comparison of the average ARI and NMI values with different neighbor parameters $k$. (B) Comparison of the average ARI and NMI values with different numbers of genes.
  • Figure 4: Comparison of the average ARI and NMI values with different data pruning rates and pruning strategies.