Table of Contents
Fetching ...

SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction

Lawrence Phillips, Marc Boubnovski Martell, Aditya Misra, Josefa Lia Stoisser, Cesar A. Prada-Medina, Rory Donovan-Maiye, Kaspar Märtens

TL;DR

SynthPert introduces synthetic reasoning traces to fine-tune LLMs for cellular perturbation prediction, directly addressing a three-class output over (cell type, perturbation, gene). By generating high-quality chain-of-thought explanations from a frontier model and training an 8B LLM with LoRA, SynthPert achieves state-of-the-art performance on the PerturbQA benchmark and demonstrates robust cross-cell-type generalization, using as little as 2% of the available data. The key finding is that structured reasoning traces, not merely raw data, drive domain-specific generalization and enable a smaller model to surpass its teacher on complex biology tasks. This synthetic reasoning distillation offers a practical, interpretable pathway to enhance biology-focused reasoning in LLMs with far greater data efficiency and transferability.

Abstract

Predicting cellular responses to genetic perturbations represents a fundamental challenge in systems biology, critical for advancing therapeutic discovery and virtual cell modeling. While large language models (LLMs) show promise for biological reasoning, their application to perturbation prediction remains underexplored due to challenges in adapting them to structured experimental data. We present SynthPert, a novel method that enhances LLM performance through supervised fine-tuning on synthetic reasoning traces generated by frontier models. Using the PerturbQA benchmark, we demonstrate that our approach not only achieves state-of-the-art performance but surpasses the capabilities of the frontier model that generated the training data. Our results reveal three key insights: (1) Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate, (2) This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells, and (3) Performance gains persist despite using only 2% of quality-filtered training data. This work shows the effectiveness of synthetic reasoning distillation for enhancing domain-specific reasoning in LLMs.

SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction

TL;DR

SynthPert introduces synthetic reasoning traces to fine-tune LLMs for cellular perturbation prediction, directly addressing a three-class output over (cell type, perturbation, gene). By generating high-quality chain-of-thought explanations from a frontier model and training an 8B LLM with LoRA, SynthPert achieves state-of-the-art performance on the PerturbQA benchmark and demonstrates robust cross-cell-type generalization, using as little as 2% of the available data. The key finding is that structured reasoning traces, not merely raw data, drive domain-specific generalization and enable a smaller model to surpass its teacher on complex biology tasks. This synthetic reasoning distillation offers a practical, interpretable pathway to enhance biology-focused reasoning in LLMs with far greater data efficiency and transferability.

Abstract

Predicting cellular responses to genetic perturbations represents a fundamental challenge in systems biology, critical for advancing therapeutic discovery and virtual cell modeling. While large language models (LLMs) show promise for biological reasoning, their application to perturbation prediction remains underexplored due to challenges in adapting them to structured experimental data. We present SynthPert, a novel method that enhances LLM performance through supervised fine-tuning on synthetic reasoning traces generated by frontier models. Using the PerturbQA benchmark, we demonstrate that our approach not only achieves state-of-the-art performance but surpasses the capabilities of the frontier model that generated the training data. Our results reveal three key insights: (1) Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate, (2) This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells, and (3) Performance gains persist despite using only 2% of quality-filtered training data. This work shows the effectiveness of synthetic reasoning distillation for enhancing domain-specific reasoning in LLMs.

Paper Structure

This paper contains 38 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Illustration of the SynthPert workflow. Given experimental perturbation data in the form of "(perturbation, gene, outcome)" data tuples (top left panel), our goal is to create an LLM capable of predicting responses to unseen perturbations. We consider two supervised fine-tuning (SFT) strategies: (i) a baseline where we apply SFT on experimental data directly (bottom left panel), and (ii) a synthetic chain-of-thought based supervised fine-tuning. The arrows between panels indicate information flow. In particular, the latter involves experimental data indirectly, in the process of creating synthetic reasoning traces for given data tuples, using a frontier LLM. A separate judge LLM evaluates their quality, and keeps only those synthetic explanations that were graded "excellent". Finally, we fine-tune the base LLM on the generated chain-of-thought explanations.