Table of Contents
Fetching ...

Contextualizing biological perturbation experiments through language

Menghua Wu, Russell Littman, Jacob Levine, Lin Qiu, Tommaso Biancalani, David Richmond, Jan-Christian Huetter

TL;DR

This work introduces PerturbQA, a benchmark for language-grounded reasoning over perturbation experiments, to address a gap where semantic biology is underutilized by prior models. It reframes perturbation outcomes as discrete, downstream-relevant tasks (differential expression, direction of change, and gene-set enrichment) and grounds reasoning in domain knowledge graphs via a retrieval-augmented, chain-of-thought prompting framework called Summer. Across five Perturb-seq datasets, current methods underperform on PerturbQA, while Summer—an 8B/70B-parameter LLM setup with retrieval and KG-informed prompts—matches or exceeds prior state-of-the-art without model fine-tuning. The work provides code and data to encourage broader adoption of language-based approaches in perturbation biology and highlights future directions for richer, more reliable biological reasoning with LLMs. Overall, PerturbQA and Summer offer a practical path toward more interpretable, knowledge-grounded perturbation analyses with potential to reduce experimental burden and improve downstream interpretation.

Abstract

High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at https://github.com/genentech/PerturbQA.

Contextualizing biological perturbation experiments through language

TL;DR

This work introduces PerturbQA, a benchmark for language-grounded reasoning over perturbation experiments, to address a gap where semantic biology is underutilized by prior models. It reframes perturbation outcomes as discrete, downstream-relevant tasks (differential expression, direction of change, and gene-set enrichment) and grounds reasoning in domain knowledge graphs via a retrieval-augmented, chain-of-thought prompting framework called Summer. Across five Perturb-seq datasets, current methods underperform on PerturbQA, while Summer—an 8B/70B-parameter LLM setup with retrieval and KG-informed prompts—matches or exceeds prior state-of-the-art without model fine-tuning. The work provides code and data to encourage broader adoption of language-based approaches in perturbation biology and highlights future directions for richer, more reliable biological reasoning with LLMs. Overall, PerturbQA and Summer offer a practical path toward more interpretable, knowledge-grounded perturbation analyses with potential to reduce experimental burden and improve downstream interpretation.

Abstract

High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at https://github.com/genentech/PerturbQA.

Paper Structure

This paper contains 30 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: A) Perturb-seq experiments result in a matrix of gene expression levels, which are interpreted through discrete outcomes. B) Textually-rich, biological knowledge graphs can help explain these outcomes. C) Based on this premise, PerturbQA introduces three tasks: predicting differential expression and direction of change for unseen perturbations, and summarizing data-driven gene clusters into cohesive sets.
  • Figure 2: PerturbQA dataset statistics. A) Differential expression and direction of change. B) Distribution of genes per cluster (gene set enrichment), with sample annotations. C) Knowledge graph sizes. D) DE genes are more likely to interact physically, but presence of interaction is minimally predictive (Table \ref{['table:dge']}). There is little difference in network connectivity.
  • Figure 3: Overview of Summer. A) Knowledge graph entries are summarized per gene as both a perturbation $p$ and as a downstream gene $g$. B) Given a new pair $(p,g)$, sample related pairs $(p',g')$ with associated experimental outcomes. C) Concatenate summaries, examples, and guiding questions as prompt for LLM. Depicted prompt edited for concision. Full prompts in Appendix \ref{['sec:prompts']}.
  • Figure 4: Assessing p-value calibration over single-cell datasets. We split the non-targeting controls (NTCs) randomly in half, and run the Wilcoxon test to compare the two halves. We would expect to see that the (non-adjusted) p-values are uniformly distributed between 0 and 1. Here, we see that the Wilcoxon test is slightly conservative, i.e. it leans towards reporting "non-differentially expressed."
  • Figure 5: K562 gene clusters show consistent response between biological replicates. We compute the top $k=5, 10$ significant gene clusters, sorted by adjusted p-value, for both K562 genome-wide and K562 essential. For each perturbation, we compute the percentage of shared gene clusters (normalizing by genome-wide and essential, respectively). We see that the clusters are relatively consistent across both datasets, with a high fraction of perfect overlaps.
  • ...and 1 more figures