Table of Contents
Fetching ...

Reliable and diverse evaluation of LLM medical knowledge mastery

Yuxuan Zhou, Xien Liu, Chen Ning, Xiao Zhang, Ji Wu

TL;DR

A novel framework PretexEval is proposed that dynamically generates reliable and diverse test samples to evaluate LLMs for any given medical knowledge base, and systematically investigates the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment.

Abstract

Mastering medical knowledge is crucial for medical-specific LLMs. However, despite the existence of medical benchmarks like MedQA, a unified framework that fully leverages existing knowledge bases to evaluate LLMs' mastery of medical knowledge is still lacking. In the study, we propose a novel framework PretexEval that dynamically generates reliable and diverse test samples to evaluate LLMs for any given medical knowledge base. We notice that test samples produced directly from knowledge bases by templates or LLMs may introduce factual errors and also lack diversity. To address these issues, we introduce a novel schema into our proposed evaluation framework that employs predicate equivalence transformations to produce a series of variants for any given medical knowledge point. Finally, these produced predicate variants are converted into textual language, resulting in a series of reliable and diverse test samples to evaluate whether LLMs fully master the given medical factual knowledge point. Here, we use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment. The evaluation results illustrate that current LLMs still exhibit significant deficiencies in fully mastering medical knowledge, despite achieving considerable success on some famous public benchmarks. These new findings provide valuable insights for developing medical-specific LLMs, highlighting that current LLMs urgently need to strengthen their comprehensive and in-depth mastery of medical knowledge before being applied to real-world medical scenarios.

Reliable and diverse evaluation of LLM medical knowledge mastery

TL;DR

A novel framework PretexEval is proposed that dynamically generates reliable and diverse test samples to evaluate LLMs for any given medical knowledge base, and systematically investigates the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment.

Abstract

Mastering medical knowledge is crucial for medical-specific LLMs. However, despite the existence of medical benchmarks like MedQA, a unified framework that fully leverages existing knowledge bases to evaluate LLMs' mastery of medical knowledge is still lacking. In the study, we propose a novel framework PretexEval that dynamically generates reliable and diverse test samples to evaluate LLMs for any given medical knowledge base. We notice that test samples produced directly from knowledge bases by templates or LLMs may introduce factual errors and also lack diversity. To address these issues, we introduce a novel schema into our proposed evaluation framework that employs predicate equivalence transformations to produce a series of variants for any given medical knowledge point. Finally, these produced predicate variants are converted into textual language, resulting in a series of reliable and diverse test samples to evaluate whether LLMs fully master the given medical factual knowledge point. Here, we use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment. The evaluation results illustrate that current LLMs still exhibit significant deficiencies in fully mastering medical knowledge, despite achieving considerable success on some famous public benchmarks. These new findings provide valuable insights for developing medical-specific LLMs, highlighting that current LLMs urgently need to strengthen their comprehensive and in-depth mastery of medical knowledge before being applied to real-world medical scenarios.
Paper Structure (36 sections, 9 equations, 11 figures, 15 tables)

This paper contains 36 sections, 9 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Drawbacks of test samples produced directly by LLMs: (1) LLMs may introduce factual errors into generated samples; (2) Samples directly generated by LLMs exhibit low diversity.
  • Figure 2: Schema of the proposed Predicate-to-text evaluation method (Top) compared with directly generating test variants by LLMs (Bottom).
  • Figure 3: An overview of the proposed PretexEval framework, which dynamically generates test samples from any medical knowledge base for evaluating LLMs’ medical knowledge mastery.
  • Figure 4: Performance (joint accuracy) of 7 typical LLMs evaluated by increasing the number of expressions per knowledge point. Top: overall performance trend averaged across LLMs; bottom: detailed performance for each LLM. To eliminate the impact of sample addition orders, we enumerate all possible orders and averaged the results, where the value at $x=i$ corresponds to the expected joint accuracy evaluated with any $i$ samples.
  • Figure 5: Left: Results of the human analysis on the reliability and diversity (lexical, structural) of samples generated by different methods; Right: Text examples in different grades of diversity.
  • ...and 6 more figures