From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

Ahmed Abdeen Hamed; Alessandro Crimi; Magdalena M. Misiak; Byung Suk Lee

From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

Ahmed Abdeen Hamed, Alessandro Crimi, Magdalena M. Misiak, Byung Suk Lee

TL;DR

This work addresses the risk of factual errors in biomedical knowledge generated by LLMs and proposes a rigorous verification pipeline. It generates disease-centric term associations via prompt engineering and verifies them with biomedical ontologies (DOID, ChEBI, SYMP, GO) as well as literature via PubMed, while testing consistency across ChatGPT models. Key findings show high term-accuracy for diseases, drugs, and genes (roughly 88–98%), lower performance for symptoms, and generally strong literature coverage for disease–drug and disease–gene links with positive temporal trends, though symptom associations remain weaker. The approach offers a scalable path toward reliable AI-assisted knowledge generation and integration in biomedicine, with implications for retrieval-augmented workflows and knowledge-graph construction.

Abstract

The generative capabilities of LLM models offer opportunities for accelerating tasks but raise concerns about the authenticity of the knowledge they produce. To address these concerns, we present a computational approach that evaluates the factual accuracy of biomedical knowledge generated by an LLM. Our approach consists of two processes: generating disease-centric associations and verifying these associations using the semantic framework of biomedical ontologies. Using ChatGPT as the selected LLM, we designed prompt-engineering processes to establish linkages between diseases and related drugs, symptoms, and genes, and assessed consistency across multiple ChatGPT models (e.g., GPT-turbo, GPT-4, etc.). Experimental results demonstrate high accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and genetic information (88%-98%). However, symptom term identification was notably lower (49%-61%), due to the informal and verbose nature of symptom descriptions, which hindered effective semantic matching with the formal language of specialized ontologies. Verification of associations reveals literature coverage rates of 89%-91% for disease-drug and disease-gene pairs, while symptom-related associations exhibit lower coverage (49%-62%).

From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

TL;DR

Abstract

From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)