Table of Contents
Fetching ...

From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

Ahmed Abdeen Hamed, Alessandro Crimi, Magdalena M. Misiak, Byung Suk Lee

TL;DR

This work addresses the risk of factual errors in biomedical knowledge generated by LLMs and proposes a rigorous verification pipeline. It generates disease-centric term associations via prompt engineering and verifies them with biomedical ontologies (DOID, ChEBI, SYMP, GO) as well as literature via PubMed, while testing consistency across ChatGPT models. Key findings show high term-accuracy for diseases, drugs, and genes (roughly 88–98%), lower performance for symptoms, and generally strong literature coverage for disease–drug and disease–gene links with positive temporal trends, though symptom associations remain weaker. The approach offers a scalable path toward reliable AI-assisted knowledge generation and integration in biomedicine, with implications for retrieval-augmented workflows and knowledge-graph construction.

Abstract

The generative capabilities of LLM models offer opportunities for accelerating tasks but raise concerns about the authenticity of the knowledge they produce. To address these concerns, we present a computational approach that evaluates the factual accuracy of biomedical knowledge generated by an LLM. Our approach consists of two processes: generating disease-centric associations and verifying these associations using the semantic framework of biomedical ontologies. Using ChatGPT as the selected LLM, we designed prompt-engineering processes to establish linkages between diseases and related drugs, symptoms, and genes, and assessed consistency across multiple ChatGPT models (e.g., GPT-turbo, GPT-4, etc.). Experimental results demonstrate high accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and genetic information (88%-98%). However, symptom term identification was notably lower (49%-61%), due to the informal and verbose nature of symptom descriptions, which hindered effective semantic matching with the formal language of specialized ontologies. Verification of associations reveals literature coverage rates of 89%-91% for disease-drug and disease-gene pairs, while symptom-related associations exhibit lower coverage (49%-62%).

From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

TL;DR

This work addresses the risk of factual errors in biomedical knowledge generated by LLMs and proposes a rigorous verification pipeline. It generates disease-centric term associations via prompt engineering and verifies them with biomedical ontologies (DOID, ChEBI, SYMP, GO) as well as literature via PubMed, while testing consistency across ChatGPT models. Key findings show high term-accuracy for diseases, drugs, and genes (roughly 88–98%), lower performance for symptoms, and generally strong literature coverage for disease–drug and disease–gene links with positive temporal trends, though symptom associations remain weaker. The approach offers a scalable path toward reliable AI-assisted knowledge generation and integration in biomedicine, with implications for retrieval-augmented workflows and knowledge-graph construction.

Abstract

The generative capabilities of LLM models offer opportunities for accelerating tasks but raise concerns about the authenticity of the knowledge they produce. To address these concerns, we present a computational approach that evaluates the factual accuracy of biomedical knowledge generated by an LLM. Our approach consists of two processes: generating disease-centric associations and verifying these associations using the semantic framework of biomedical ontologies. Using ChatGPT as the selected LLM, we designed prompt-engineering processes to establish linkages between diseases and related drugs, symptoms, and genes, and assessed consistency across multiple ChatGPT models (e.g., GPT-turbo, GPT-4, etc.). Experimental results demonstrate high accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and genetic information (88%-98%). However, symptom term identification was notably lower (49%-61%), due to the informal and verbose nature of symptom descriptions, which hindered effective semantic matching with the formal language of specialized ontologies. Verification of associations reveals literature coverage rates of 89%-91% for disease-drug and disease-gene pairs, while symptom-related associations exhibit lower coverage (49%-62%).

Paper Structure

This paper contains 22 sections, 7 figures, 5 tables, 3 algorithms.

Figures (7)

  • Figure 1: Graphical Abstract of the Knowledge Generation and Verification Tasks Performed.
  • Figure 2: Overview of disease associations with drugs, genes, and symptoms.
  • Figure 3: The term "Hypertension", identified as DOID:10763 in the DOID ontology, includes a metadata item for synonyms. The term verification algorithm (Algorithm \ref{['alg:term_verification']}) utilizes this synonyms field to semantically verify the legitimacy of terms generated by ChatGPT.
  • Figure 4: Accuracy of DOID-ChEBI, DOID-SYMP, and DOID-GO associations across various features.
  • Figure 5: Combined visualization of average co-occurrences (line chart) and literature co-occurrence statistics (bar charts). The top plot shows trends in co-occurrences over time, while the bottom plot compares verified and unverified links for different association types.
  • ...and 2 more figures