Table of Contents
Fetching ...

Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning

Da Kuang, Guanwen Qiu, Junhyong Kim

TL;DR

This work presents CellTreeQM, a transformer-based framework that reframes cell lineage reconstruction as a metric-learning problem to recover lineage trees from high-dimensional phenotypic data. By enforcing tree-metric properties via a quartet-based additivity loss, incorporating feature gating, and using a deviation regularizer, the method yields embeddings that enable accurate lineage inference even with limited supervision. The authors establish a Lineage Reconstruction Benchmark and demonstrate superior performance over standard contrastive losses across supervised, weakly supervised, and unsupervised settings on synthetic data and lineage-resolved C. elegans datasets. The approach offers a scalable, data-efficient pathway to uncover molecular dynamics of lineage decisions in challenging organisms, with practical implications for developmental biology and regenerative medicine.

Abstract

How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells grow, divide, and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage division and differentiation histories, providing an analytical framework for dissecting individual cells' molecular decisions during replication and differentiation. Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. In contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems. Here, we introduce CellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating lineage reconstruction as a tree-metric learning problem, we have systematically explored supervised, weakly supervised, and unsupervised training settings and present a Lineage Reconstruction Benchmark to facilitate comprehensive evaluation of our learning method. We benchmarked the method on (1) synthetic data modeled via Brownian motion with independent noise and spurious signals and (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that CellTreeQM recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships in challenging animal models. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage.

Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning

TL;DR

This work presents CellTreeQM, a transformer-based framework that reframes cell lineage reconstruction as a metric-learning problem to recover lineage trees from high-dimensional phenotypic data. By enforcing tree-metric properties via a quartet-based additivity loss, incorporating feature gating, and using a deviation regularizer, the method yields embeddings that enable accurate lineage inference even with limited supervision. The authors establish a Lineage Reconstruction Benchmark and demonstrate superior performance over standard contrastive losses across supervised, weakly supervised, and unsupervised settings on synthetic data and lineage-resolved C. elegans datasets. The approach offers a scalable, data-efficient pathway to uncover molecular dynamics of lineage decisions in challenging organisms, with practical implications for developmental biology and regenerative medicine.

Abstract

How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells grow, divide, and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage division and differentiation histories, providing an analytical framework for dissecting individual cells' molecular decisions during replication and differentiation. Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. In contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems. Here, we introduce CellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating lineage reconstruction as a tree-metric learning problem, we have systematically explored supervised, weakly supervised, and unsupervised training settings and present a Lineage Reconstruction Benchmark to facilitate comprehensive evaluation of our learning method. We benchmarked the method on (1) synthetic data modeled via Brownian motion with independent noise and spurious signals and (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that CellTreeQM recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships in challenging animal models. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage.

Paper Structure

This paper contains 106 sections, 3 theorems, 38 equations, 24 figures, 14 tables.

Key Result

Lemma 4.1

If $\{\mathbf{x}_i\}$ arise from a continuous-time Markov process on a tree $T$ with independent Gaussian increments along each edge (e.g., Brownian motion), then for any two leaves $i$ and $j$, where $D_{T}(i, j)$ is an additive (tree) distance reflecting the unique path between $i$ and $j$.

Figures (24)

  • Figure 1: Exploring phenotype-based cell lineage reconstruction. This figure highlights the focus of our study on reconstructing cell lineage trees using phenotype data, specifically gene expression profiles (right panel), in contrast to traditional methods that rely on genotype data, such as DNA sequences (left panel).
  • Figure 2: Overview of the CellTreeQM workflow for lineage reconstruction using Metric Learning. When the full tree is known as prior knowledge, this is a supervised setting. When no prior information about the tree is available, the setting is unsupervised. In between, we highlight two weakly supervised settings: the High-level Partitioning Setting, where only high-level groupings are available, and the Partially Leaf-labeled Setting, where topological labels are provided for a subset of leaves.
  • Figure 3: Geometric intuition of the quartet loss. (a-c) show the three possible unrooted tree topologies for a quartet $\{A,B,C,D\}$, (d) depicts the "box-like" distortion that arises if the distances are not perfectly additive.
  • Figure 4: Supervised training dynamics on the simulation and C. elegans Small dataset. (a) For the simulated dataset, the dashed pink line represents the RF distance of the tree reconstructed from the raw data, while the dashed black line indicates the RF distance using only the "signal’’ features. The dashed purple line shows the average Pagel’s $\lambda$ of the selected features, serving as a benchmark for phylogenetic signal strength. (b) The feature selection process gradually excludes noise features (blue), while the number of selected signal features (green) stabilizes. The difference in gating values (dashed red) decreases over time, indicating that feature selection is converging. (c-d) Training on the C. elegans Small dataset: (c) RF distance and (d) the number of selected features exhibit similar trends as observed in the simulation dataset.
  • Figure 5: Training dynamics of CellTreeQM in a purely unsupervised setting on a simulated dataset. Optimal is the RF of reconstructed tree only based on signal features.
  • ...and 19 more figures

Theorems & Definitions (4)

  • Lemma 4.1: Additivity of Expected Distances, Informal
  • Theorem A.1: 4-Point Condition
  • Lemma B.1: Additivity of Expected Distances
  • proof