Table of Contents
Fetching ...

Large Language Models for Variant-Centric Functional Evidence Mining

Ali Saadat, Jacques Fellay

Abstract

Functional evidence is essential for clinical interpretation of genomic variants, but identifying relevant studies and translating experimental results into structured evidence remains labor intensive. We developed a benchmark based on ClinGen curated annotations to evaluate two large language models (LLMs), a non reasoning model (gpt-4o-mini) and a reasoning model (o4-mini), on tasks relevant to functional evidence curation: (1) abstract screening to determine whether a study reports functional experiments directly testing specific variants, and (2) full text evidence extraction and classification from matched variant-paper pairs, including interpretation of evidence direction and generation of evidence summaries. Starting from ClinGen variants annotated with functional evidence, we processed curator comments with an LLM to extract PubMed identifiers, evidence labels, and narrative, and retrieved titles, abstracts, and open access PDFs to construct variant-paper pairs. In abstract screening, both models achieved high recall (0.88-0.90) with moderate specificity (0.59-0.65). For full text evidence classification under an explicit variant matching gate, o4-mini achieved 96% accuracy and higher specificity (0.83 vs. 0.37) while maintaining high F1 (0.98 vs. 0.96) compared with gpt-4o-mini. We also used an LLM-as-judge protocol to compare model generated evidence summaries with expert curator comments. Finally, we developed AcmGENTIC, an end to end pipeline that expands variant identifiers, retrieves literature via LitVar2, filters abstracts with LLMs, acquires PDFs, performs multimodal evidence extraction, and generates evidence reports for curator review, with optional agentic parsing of figures and tables. Together, this benchmark and pipeline provide a practical framework for scaling functional evidence curation with human in the loop LLM assistance.

Large Language Models for Variant-Centric Functional Evidence Mining

Abstract

Functional evidence is essential for clinical interpretation of genomic variants, but identifying relevant studies and translating experimental results into structured evidence remains labor intensive. We developed a benchmark based on ClinGen curated annotations to evaluate two large language models (LLMs), a non reasoning model (gpt-4o-mini) and a reasoning model (o4-mini), on tasks relevant to functional evidence curation: (1) abstract screening to determine whether a study reports functional experiments directly testing specific variants, and (2) full text evidence extraction and classification from matched variant-paper pairs, including interpretation of evidence direction and generation of evidence summaries. Starting from ClinGen variants annotated with functional evidence, we processed curator comments with an LLM to extract PubMed identifiers, evidence labels, and narrative, and retrieved titles, abstracts, and open access PDFs to construct variant-paper pairs. In abstract screening, both models achieved high recall (0.88-0.90) with moderate specificity (0.59-0.65). For full text evidence classification under an explicit variant matching gate, o4-mini achieved 96% accuracy and higher specificity (0.83 vs. 0.37) while maintaining high F1 (0.98 vs. 0.96) compared with gpt-4o-mini. We also used an LLM-as-judge protocol to compare model generated evidence summaries with expert curator comments. Finally, we developed AcmGENTIC, an end to end pipeline that expands variant identifiers, retrieves literature via LitVar2, filters abstracts with LLMs, acquires PDFs, performs multimodal evidence extraction, and generates evidence reports for curator review, with optional agentic parsing of figures and tables. Together, this benchmark and pipeline provide a practical framework for scaling functional evidence curation with human in the loop LLM assistance.

Paper Structure

This paper contains 29 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of AcmGENTIC. Starting from a user-specified variant, AcmGENTIC performs (i) variant normalization and synonym expansion across common identifiers, (ii) literature retrieval and abstract-level screening to prioritize likely functional studies, (iii) PDF acquisition and variant matching, (iv) structured experiment extraction and PS3/BS3-aligned evidence interpretation, and (v) report generation for curator verification and downstream use.
  • Figure 2: Abstract-level screening performance for identifying variant-linked functional experiments (Figure \ref{['fig:abstract']}; Table \ref{['tab:abstract-highrecall']}). Bars report accuracy, precision, recall (sensitivity), F1, and specificity for gpt-4o-mini and o4-mini. Both models achieve high recall, supporting use as an abstract triage filter to surface candidate functional studies for downstream full-text review.
  • Figure 3: Variant matching outcomes by model. Most pairs are successfully matched via exact identifier detection (matched), with additional matches obtained through heuristic alignment or single-variant-study inference. A substantial fraction remains unmatched.
  • Figure 4: Full-text evidence direction performance (PS3 vs. BS3) on successfully matched examples. The reasoning model (o4-mini) substantially improves specificity while maintaining high recall.
  • Figure 5: Evidence strength classification performance (4-way) on successfully matched examples. Strength grading remains challenging for both models.
  • ...and 4 more figures