Table of Contents
Fetching ...

GraphSeqLM: A Unified Graph Language Framework for Omic Graph Learning

Heming Zhang, Di Huang, Yixin Chen, Fuhai Li

TL;DR

The paper tackles the challenge of noisy, high-dimensional multi-omic data by enriching graph-based models with biological sequence information. It introduces GraphSeqLM, a three-component framework that encodes DNA, RNA, and protein sequences with LLMs, fuses them with multi-omic data on a MOS-KG, and uses a GNN to predict patient outcomes. Empirical results on 826 cancer samples across 6 cancer types show that GraphSeqLM outperforms strong GNN baselines in accuracy and F1. The approach demonstrates a path toward more accurate, interpretable integration of topology, sequence-derived features, and omic signals for precision medicine.

Abstract

The integration of multi-omic data is pivotal for understanding complex diseases, but its high dimensionality and noise present significant challenges. Graph Neural Networks (GNNs) offer a robust framework for analyzing large-scale signaling pathways and protein-protein interaction networks, yet they face limitations in expressivity when capturing intricate biological relationships. To address this, we propose Graph Sequence Language Model (GraphSeqLM), a framework that enhances GNNs with biological sequence embeddings generated by Large Language Models (LLMs). These embeddings encode structural and biological properties of DNA, RNA, and proteins, augmenting GNNs with enriched features for analyzing sample-specific multi-omic data. By integrating topological, sequence-derived, and biological information, GraphSeqLM demonstrates superior predictive accuracy and outperforms existing methods, paving the way for more effective multi-omic data integration in precision medicine.

GraphSeqLM: A Unified Graph Language Framework for Omic Graph Learning

TL;DR

The paper tackles the challenge of noisy, high-dimensional multi-omic data by enriching graph-based models with biological sequence information. It introduces GraphSeqLM, a three-component framework that encodes DNA, RNA, and protein sequences with LLMs, fuses them with multi-omic data on a MOS-KG, and uses a GNN to predict patient outcomes. Empirical results on 826 cancer samples across 6 cancer types show that GraphSeqLM outperforms strong GNN baselines in accuracy and F1. The approach demonstrates a path toward more accurate, interpretable integration of topology, sequence-derived features, and omic signals for precision medicine.

Abstract

The integration of multi-omic data is pivotal for understanding complex diseases, but its high dimensionality and noise present significant challenges. Graph Neural Networks (GNNs) offer a robust framework for analyzing large-scale signaling pathways and protein-protein interaction networks, yet they face limitations in expressivity when capturing intricate biological relationships. To address this, we propose Graph Sequence Language Model (GraphSeqLM), a framework that enhances GNNs with biological sequence embeddings generated by Large Language Models (LLMs). These embeddings encode structural and biological properties of DNA, RNA, and proteins, augmenting GNNs with enriched features for analyzing sample-specific multi-omic data. By integrating topological, sequence-derived, and biological information, GraphSeqLM demonstrates superior predictive accuracy and outperforms existing methods, paving the way for more effective multi-omic data integration in precision medicine.

Paper Structure

This paper contains 8 sections, 10 equations, 2 tables.