Table of Contents
Fetching ...

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, Bo Wang

TL;DR

BioReason presents a novel multimodal framework that couples a DNA foundation encoder with a large language model to achieve interpretable, multi-step biological reasoning over genomic data. Through supervised fine-tuning and reinforcement learning (GRPO), the approach delivers substantial performance gains on KEGG-based disease-pathway reasoning and variant effect prediction, while producing step-by-step mechanistic explanations. The work introduces a KEGG-derived reasoning dataset and demonstrates robust gains across coding and non-SNV variant tasks, highlighting the value of integrating sequence embeddings with LLM reasoning. The architecture and benchmarks offer a path toward more transparent and mechanistic AI in biology, with potential extensions to additional modalities and large-scale applications in genomics and medicine.

Abstract

Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

TL;DR

BioReason presents a novel multimodal framework that couples a DNA foundation encoder with a large language model to achieve interpretable, multi-step biological reasoning over genomic data. Through supervised fine-tuning and reinforcement learning (GRPO), the approach delivers substantial performance gains on KEGG-based disease-pathway reasoning and variant effect prediction, while producing step-by-step mechanistic explanations. The work introduces a KEGG-derived reasoning dataset and demonstrates robust gains across coding and non-SNV variant tasks, highlighting the value of integrating sequence embeddings with LLM reasoning. The architecture and benchmarks offer a path toward more transparent and mechanistic AI in biology, with potential extensions to additional modalities and large-scale applications in genomics and medicine.

Abstract

Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason

Paper Structure

This paper contains 38 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: BioReason Architecture. Schematic representation of our novel multimodal framework that integrates a DNA foundation model with a Large Language Model.
  • Figure 2: BioReason Dataset Curation and Composition.A. Representative example of a KEGG Variant Network element from the 298 networks utilized in our study, illustrating the relationship between genomic variants and their corresponding disease annotation that serves as ground truth for generating mechanistic reasoning traces. B. Exemplar of a structured question-answer pair with an accompanying multi-step reasoning trace demonstrating the expected logical progression from genomic variant to phenotypic outcome. C. Pipeline for data acquisition, integration, and curation across the three BioReason tasks. D. Distribution of train/test splits across the three curated datasets. 10% of train dataset was used for validation. E. Distribution of disease categories represented within the datasets, highlighting the diversity of variants and diseases represented in the datasets.
  • Figure 3: Case Study of BioReason's Output
  • Figure 4: Mean reward progression during GRPO training across different BioReason model configurations over 1000 training steps. Shaded regions represent per-step reward variance across the batch of 8 generations per query.