Table of Contents
Fetching ...

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri

TL;DR

BiomedSQL tackles the challenge of scientific reasoning in text-to-SQL for biomedical knowledge bases by introducing a large, domain-grounded benchmark (68k QA triples) executed against a harmonized BigQuery KB. The study shows that even state-of-the-art LLMs struggle with implicit biomedical conventions and multi-hop queries, achieving far below expert performance (experts ~$90\%$ EX), though a custom multi-step system (BMSQL) improves results to about $62.6\%$ EX and $83\%$ NL-answer quality. Through extensive experiments varying prompts, interaction paradigms, and schema size, the authors identify key bottlenecks in threshold application, table selection, and multi-table reasoning, while highlighting the gains from structured, domain-specific agent design. The work provides data, code, and a reproducible evaluation framework to spur development of robust, domain-aware text-to-SQL systems for biomedical discovery.

Abstract

Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

TL;DR

BiomedSQL tackles the challenge of scientific reasoning in text-to-SQL for biomedical knowledge bases by introducing a large, domain-grounded benchmark (68k QA triples) executed against a harmonized BigQuery KB. The study shows that even state-of-the-art LLMs struggle with implicit biomedical conventions and multi-hop queries, achieving far below expert performance (experts ~ EX), though a custom multi-step system (BMSQL) improves results to about EX and NL-answer quality. Through extensive experiments varying prompts, interaction paradigms, and schema size, the authors identify key bottlenecks in threshold application, table selection, and multi-table reasoning, while highlighting the gains from structured, domain-specific agent design. The work provides data, code, and a reproducible evaluation framework to spur development of robust, domain-aware text-to-SQL systems for biomedical discovery.

Abstract

Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.

Paper Structure

This paper contains 27 sections, 12 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Example text-to-SQL workflow used to evaluate LLM performance on BiomedSQL. Given a question and the database schema information, an LLM must generate a SQL query and use its execution results to return a natural language response.
  • Figure 2: SQL category distribution.
  • Figure 3: Distribution of performance across SQL categories for GPT-o3-mini in terms of EX (left) and RQR (right), across four prompting and interaction paradigms.
  • Figure 4: Distribution of biological reasoning query types.
  • Figure 5: Heatmaps visualizing the association between (left) EX and BioScore and (right) JAC and BioScore.