Table of Contents
Fetching ...

LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation

Tejumade Afonja, Ivaxi Sheth, Ruta Binkyte, Waqar Hanif, Thomas Ulas, Matthias Becker, Mario Fritz

TL;DR

This work develops a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs, and uses the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset.

Abstract

Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these networks is crucial for uncovering disease mechanisms and identifying therapeutic targets. In this work, we investigate the potential of large language models (LLMs) for GRN discovery, leveraging their learned biological knowledge alone or in combination with traditional statistical methods. We develop a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs. Specifically, we use the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset. Our statistical and biological assessments show that LLMs can support statistical modeling and data synthesis for biological research.

LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation

TL;DR

This work develops a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs, and uses the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset.

Abstract

Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these networks is crucial for uncovering disease mechanisms and identifying therapeutic targets. In this work, we investigate the potential of large language models (LLMs) for GRN discovery, leveraging their learned biological knowledge alone or in combination with traditional statistical methods. We develop a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs. Specifically, we use the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset. Our statistical and biological assessments show that LLMs can support statistical modeling and data synthesis for biological research.

Paper Structure

This paper contains 66 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of LLM4GRN. Setting 1.A combines Human Knowledge Base (KB) with LLM. Setting 1.B is the baseline setting that combines Human KB with GRNBoost2. Setting 2.A is full LLM pipeline that combines LLM KB and LLM Inference. Setting 2.B combines LLM KB with GRNBoost2 Inference.
  • Figure 2: Overlap between GRNs proposed by different methods. LLM demonstrates a higher self-overlap compared to GRNBoost2 algorithm.
  • Figure 3: Dot plots illustrating the gene expression profiles of top marker genes across different cell types. The red color represent overexpression of the marker gene in the cell type while blue color represents the downregulation. Size of the bubble (dot) represents the cell percentage or fraction of the expression.
  • Figure 4: Cell type proportions analysis of GPT4-KB LGRNBoost2 and Llama-KB GRNBoost2 datasets reveals similar cell type proportions where the differences in percentages is between 0.1% to 4% across same cell types between the two datasets.
  • Figure 5: TSNE projections of synthetic vs. real data for different GRN graphs (Setting 1). "Control" corresponds to the projection of training and testing data, "GRNB2" -the synthetic data based on GRNBoost2 graph, "LLM" - synthetic data based on LLM graph, and "RAND" - the data based on the random graph. Red dots correspond to the real data points and blue ones - to the synthetic data points. "Hallucinated" extra blue clusters in the RAND graph are marked with red circles.
  • ...and 4 more figures