CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
Owen Queen, Harrison G. Zhang, James Zou
TL;DR
CGBench tackles the gap between synthetic benchmarks and real-world translational clinical genetics by evaluating eight language models on three ClinGen-derived tasks: VCI Evidence Scoring, VCI Evidence Verification, and GCI Experimental Evidence Extraction. It leverages ClinGen SOPs and ERepo annotations to test LM reasoning, evidence extraction, and explanation alignment, using prompting strategies and a novel LM-as-a-judge framework. Findings show that reasoning-enabled LMs excel at fine-grained interpretation but still struggle with precise evidence strength judgments, while non-reasoning models can perform relatively better on high-level tasks; explanations from LMs often diverge from expert ClinGen explanations, highlighting alignment challenges. The results illuminate current capabilities and limitations, and point to future directions in prompt design, multi-document reasoning, and human-AI collaboration for clinical genetics research.
Abstract
Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly.
