Table of Contents
Fetching ...

OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling

Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Lars Schimmelpfennig, Levi Kaster, Di Huang, Carlos Cruchaga, Guangfu Li, Michael Province, Yixin Chen, Philip Payne, Fuhai Li

TL;DR

OmniCellTOSG introduces the first Text-Omic Signaling Graph (TOSG) dataset to jointly leverage large language models (LLMs) and graph neural networks (GNNs) for decoding complex cell signaling. It integrates human-readable annotations (functions, localizations, diseases, drugs) with quantitative gene/protein abundances, built from approximately 120 million single cells across diverse tissues and conditions, and distilled into meta-cells via SEACells. The authors detail a full pipeline—from multi-source data collection and preprocessing to entity matching, TOSG construction, and a PyTorch-friendly data package—culminating in a joint LLM-GNN foundation model that masks edges and fuses omic and textual features through bi-encoders and cross-modality fusion. Experiments on disease-relevant benchmarks show that the proposed CellTOSG foundation model outperforms standard GNN baselines, highlighting the value of integrating textual prior knowledge with numeric omics for cell signaling inference and potential precision medicine applications.

Abstract

Complex cell signaling systems -- governed by varying protein abundances and interactions -- generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations -- such as biological functions, cellular locations, signaling pathways, related diseases, and drugs -- with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at https://github.com/FuhaiLiAiLab/OmniCellTOSG.

OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling

TL;DR

OmniCellTOSG introduces the first Text-Omic Signaling Graph (TOSG) dataset to jointly leverage large language models (LLMs) and graph neural networks (GNNs) for decoding complex cell signaling. It integrates human-readable annotations (functions, localizations, diseases, drugs) with quantitative gene/protein abundances, built from approximately 120 million single cells across diverse tissues and conditions, and distilled into meta-cells via SEACells. The authors detail a full pipeline—from multi-source data collection and preprocessing to entity matching, TOSG construction, and a PyTorch-friendly data package—culminating in a joint LLM-GNN foundation model that masks edges and fuses omic and textual features through bi-encoders and cross-modality fusion. Experiments on disease-relevant benchmarks show that the proposed CellTOSG foundation model outperforms standard GNN baselines, highlighting the value of integrating textual prior knowledge with numeric omics for cell signaling inference and potential precision medicine applications.

Abstract

Complex cell signaling systems -- governed by varying protein abundances and interactions -- generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations -- such as biological functions, cellular locations, signaling pathways, related diseases, and drugs -- with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at https://github.com/FuhaiLiAiLab/OmniCellTOSG.

Paper Structure

This paper contains 31 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of text-omic signaling graph (TOSG) generation. (a) Millions of single cells collected from multiple tissues, diseases, and cell types. (b) The values in the collected h5ad files for those $N_0$ single cells. (c) Archtypal analysis to aggregate $N_0$ cells into $N$ meta-cells. (d-e) Integrating transcript entities into text-omic signaling network with $M$ ($M=M_t+M_p$) matched entities by retrieving the knowledge base from BioMedGraphica. (f-g) Generate the text-omic signaling graphs for the matched and virtual entities. (h) Joint text-encoder and omic encoder with cross-modality fusion. (i-j) Message propagation on the generated text-omic signaling graphs, encapsulating the fused biological and textual information for foundation model training and downstream tasks.
  • Figure 2: Observation of Meta-Cell Gene Expression Distributions and Clustering Patterns. (a) Circular visualization of differential gene expression between Alzheimer's Disease (AD) and normal brain samples. The concentric rings represent: (I) Gene expression profiles in individual cells, with the outer three rings corresponding to AD samples and the inner three rings to normal samples, randomly selected from the dataset; (II) P-values derived from a t-test comparing AD and normal cells, with the red line indicating the p $<$ 0.05 significance threshold; and (III–IV) Mean gene expression levels for AD and normal groups, respectively. (b) UMAP visualization of meta-cell clustering results for brain and bone marrow tissues. The first column presents AD and corresponding normal samples from the brain, while the second column shows Acute Myeloid Leukemia and normal samples from the bone marrow. Each color represents a cluster corresponding to a distinct cell type, with black circles indicating clusters consolidated into a single meta-cell.
  • Figure 3: Overview of the filtered dataset, highlighting diseased cells from various organ groups after excluding normal cells and brain cells due to their high abundance. Each colored segment (G1 to G10) represents a distinct organ category, with numeric labels indicating the total number of cells retained in each group.