Table of Contents
Fetching ...

Large Language Models for Zero-shot Inference of Causal Structures in Biology

Izzy Newsham, Luka Kovačević, Richard Moulange, Nan Rosemary Ke, Sach Mukherjee

TL;DR

This work establishes a framework to evaluate zero-shot causal inference capabilities of LLMs in biology by constructing a ground-truth causal graph from Perturb-seq data across 100 cancer-relevant genes and assessing LLM-derived ancestral graphs via pairwise prompting. It systematically investigates context-aware and retrieval-augmented prompts, finding that tailored experimental context improves causal-direction inference, with AUROC peaking around 0.625 on full graph inference. Chain-of-thought prompts and gene-specific literature context often fail to improve performance, while LLMs outperform a STRING-based knowledge baseline when used as priors for downstream causal discovery. Overall, the results support using LLMs as context-sensitive priors to guide causal structure learning in complex biological systems, underscoring a general framework for evaluating LLMs in causal learning and scientific discovery.

Abstract

Genes, proteins and other biological entities influence one another via causal molecular networks. Causal relationships in such networks are mediated by complex and diverse mechanisms, through latent variables, and are often specific to cellular context. It remains challenging to characterise such networks in practice. Here, we present a novel framework to evaluate large language models (LLMs) for zero-shot inference of causal relationships in biology. In particular, we systematically evaluate causal claims obtained from an LLM using real-world interventional data. This is done over one hundred variables and thousands of causal hypotheses. Furthermore, we consider several prompting and retrieval-augmentation strategies, including large, and potentially conflicting, collections of scientific articles. Our results show that with tailored augmentation and prompting, even relatively small LLMs can capture meaningful aspects of causal structure in biological systems. This supports the notion that LLMs could act as orchestration tools in biological discovery, by helping to distil current knowledge in ways amenable to downstream analysis. Our approach to assessing LLMs with respect to experimental data is relevant for a broad range of problems at the intersection of causal learning, LLMs and scientific discovery.

Large Language Models for Zero-shot Inference of Causal Structures in Biology

TL;DR

This work establishes a framework to evaluate zero-shot causal inference capabilities of LLMs in biology by constructing a ground-truth causal graph from Perturb-seq data across 100 cancer-relevant genes and assessing LLM-derived ancestral graphs via pairwise prompting. It systematically investigates context-aware and retrieval-augmented prompts, finding that tailored experimental context improves causal-direction inference, with AUROC peaking around 0.625 on full graph inference. Chain-of-thought prompts and gene-specific literature context often fail to improve performance, while LLMs outperform a STRING-based knowledge baseline when used as priors for downstream causal discovery. Overall, the results support using LLMs as context-sensitive priors to guide causal structure learning in complex biological systems, underscoring a general framework for evaluating LLMs in causal learning and scientific discovery.

Abstract

Genes, proteins and other biological entities influence one another via causal molecular networks. Causal relationships in such networks are mediated by complex and diverse mechanisms, through latent variables, and are often specific to cellular context. It remains challenging to characterise such networks in practice. Here, we present a novel framework to evaluate large language models (LLMs) for zero-shot inference of causal relationships in biology. In particular, we systematically evaluate causal claims obtained from an LLM using real-world interventional data. This is done over one hundred variables and thousands of causal hypotheses. Furthermore, we consider several prompting and retrieval-augmentation strategies, including large, and potentially conflicting, collections of scientific articles. Our results show that with tailored augmentation and prompting, even relatively small LLMs can capture meaningful aspects of causal structure in biological systems. This supports the notion that LLMs could act as orchestration tools in biological discovery, by helping to distil current knowledge in ways amenable to downstream analysis. Our approach to assessing LLMs with respect to experimental data is relevant for a broad range of problems at the intersection of causal learning, LLMs and scientific discovery.

Paper Structure

This paper contains 19 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Directed edges are drawn between the perturbed gene $k$ and the set of genes $\Delta_k=\{i, \ldots, j\}$ that change significantly under experimental intervention on $k$.
  • Figure 2: Outputs for inferring causal direction with different prompt contexts, for the example gene pair ATR and CD47.
  • Figure 3: Results on Gemma2 for all combinations of prompt variants (different contexts along each column, different gene-specific information along each row). The results are shown as the mean AUROC over 10 repetitions, with the standard error given in brackets.
  • Figure A.1: Results on Gemma2 using simple chain of thought, compared to the results using no chain of thought (shown in Figure \ref{['fig:gemma2_results_matrix']}). Green indicates the simple CoT reached a higher AUROC than no CoT and pink indicates it reached a lower AUROC than no CoT.
  • Figure A.2: Results on Gemma2 using guided chain of thought, compared to the results using no chain of thought (shown in Figure \ref{['fig:gemma2_results_matrix']}). Green indicates the guided CoT reached a higher AUROC than no CoT and pink indicates it reached a lower AUROC than no CoT.
  • ...and 3 more figures