BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, Bo Wang
TL;DR
BioReason presents a novel multimodal framework that couples a DNA foundation encoder with a large language model to achieve interpretable, multi-step biological reasoning over genomic data. Through supervised fine-tuning and reinforcement learning (GRPO), the approach delivers substantial performance gains on KEGG-based disease-pathway reasoning and variant effect prediction, while producing step-by-step mechanistic explanations. The work introduces a KEGG-derived reasoning dataset and demonstrates robust gains across coding and non-SNV variant tasks, highlighting the value of integrating sequence embeddings with LLM reasoning. The architecture and benchmarks offer a path toward more transparent and mechanistic AI in biology, with potential extensions to additional modalities and large-scale applications in genomics and medicine.
Abstract
Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason
