Hierarchical novel class discovery for single-cell transcriptomic profiles

Malek Senoussi; Thierry Artières; Paul Villoutreix

Hierarchical novel class discovery for single-cell transcriptomic profiles

Malek Senoussi, Thierry Artières, Paul Villoutreix

TL;DR

The paper tackles hierarchical Novel Class Discovery in single-cell transcriptomics where labeled and unlabeled label sets are disjoint. It introduces two hierarchy-aware clustering models, h-k-means and h-GMM, augmented with a hierarchical continuity loss that encourages smooth mean transitions along a cell lineage tree. Across simulated and experimental scRNA-seq datasets, the hierarchical methods generally improve clustering accuracy and approach an empirical upper bound, outperforming non-hierarchical baselines and remaining competitive with Autonovel in several settings. This work demonstrates that incorporating lineage structure into NCD for developmental data enhances both clustering and label mapping, with implications for scalable annotation in large-scale single-cell studies.

Abstract

One of the major challenges arising from single-cell transcriptomics experiments is the question of how to annotate the associated single-cell transcriptomic profiles. Because of the large size and the high dimensionality of the data, automated methods for annotation are needed. We focus here on datasets obtained in the context of developmental biology, where the differentiation process leads to a hierarchical structure. We consider a frequent setting where both labeled and unlabeled data are available at training time, but the sets of the labels of labeled data on one side and of the unlabeled data on the other side, are disjoint. It is an instance of the Novel Class Discovery problem. The goal is to achieve two objectives, clustering the data and mapping the clusters with labels. We propose extensions of k-Means and GMM clustering methods for solving the problem and report comparative results on artificial and experimental transcriptomic datasets. Our approaches take advantage of the hierarchical nature of the data.

Hierarchical novel class discovery for single-cell transcriptomic profiles

TL;DR

Abstract

Paper Structure (14 sections, 6 equations, 1 figure, 3 tables)

This paper contains 14 sections, 6 equations, 1 figure, 3 tables.

Introduction
Related works
Method
Problem formalization
Hierarchical continuity loss
Hierarchical k-means (h-k-means)
Hierarchical Gaussian Mixture Model (h-GMM)
Datasets
Simulated Datasets
Experimental Datasets
Experiments
Metrics
Results
Discussion

Figures (1)

Figure 1: Hierarchical Novel Class Discovery problem. A) The data under consideration are represented by a single-cell RNA sequencing (sc-RNA-Seq) matrix on the left, where each row represents a transcriptomic vector which belong to a class, where classes are organized in a hierarchy (lineage tree). A part of the data are labeled (colored according to their class) the other part is unlabeled (in gray). B) The panel on the left shows the ground truth distribution of the data (one color per class). The right panel shows the available supervision for training the model, where only part of the labels include all supervised data, and the other part of labels include only unlabeled data, they are plotted in black.

Hierarchical novel class discovery for single-cell transcriptomic profiles

TL;DR

Abstract

Hierarchical novel class discovery for single-cell transcriptomic profiles

Authors

TL;DR

Abstract

Table of Contents

Figures (1)