OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling
Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Lars Schimmelpfennig, Levi Kaster, Di Huang, Carlos Cruchaga, Guangfu Li, Michael Province, Yixin Chen, Philip Payne, Fuhai Li
TL;DR
OmniCellTOSG introduces the first Text-Omic Signaling Graph (TOSG) dataset to jointly leverage large language models (LLMs) and graph neural networks (GNNs) for decoding complex cell signaling. It integrates human-readable annotations (functions, localizations, diseases, drugs) with quantitative gene/protein abundances, built from approximately 120 million single cells across diverse tissues and conditions, and distilled into meta-cells via SEACells. The authors detail a full pipeline—from multi-source data collection and preprocessing to entity matching, TOSG construction, and a PyTorch-friendly data package—culminating in a joint LLM-GNN foundation model that masks edges and fuses omic and textual features through bi-encoders and cross-modality fusion. Experiments on disease-relevant benchmarks show that the proposed CellTOSG foundation model outperforms standard GNN baselines, highlighting the value of integrating textual prior knowledge with numeric omics for cell signaling inference and potential precision medicine applications.
Abstract
Complex cell signaling systems -- governed by varying protein abundances and interactions -- generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations -- such as biological functions, cellular locations, signaling pathways, related diseases, and drugs -- with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at https://github.com/FuhaiLiAiLab/OmniCellTOSG.
