Table of Contents
Fetching ...

Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation

Magdalena Wysocka, Oskar Wysocki, Maxime Delmas, Vincent Mutel, Andre Freitas

TL;DR

There is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.

Abstract

The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery. The framework involves of three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound-fungus relation determination. Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted. While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.

Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation

TL;DR

There is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.

Abstract

The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery. The framework involves of three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art models LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound-fungus relation determination. Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted. While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.
Paper Structure (24 sections, 8 figures, 13 tables)

This paper contains 24 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: The framework to streamline human expert evaluation of LLMs and the encoding of factual scientific knowledge in the Large Language Models, in the context of extracting biological relations. Entity 1 stands for chemical compound name. The entity 2 stands for fungus name. Fluency, prompt-alignment, and semantic coherence are assessed in STEP 1 by a non-expert (within the target domain). Then the domain expert evaluates the factuality in STEP 2 for Task 1. For Task 2, the factuality of the generated entity 2 (STEP 2A) is verified before the entire description is assessed (STEP 2B). Outputs that do not pass STEP 2 are classified as hallucinations. The specificity is evaluated in STEP 3.
  • Figure 2: The workflow of the performed analysis. $N$ - whole dataset with 246 selected biological relations of chemical compound-fungus pairs; $n_{relations}$ - 23 selected chemical compound-fungus pairs; $n_{entity1}$ - 10 unique chemical compounds included in those pairs; In prompt engineering: blue - entity 1, orange - added context. Qualitative evaluation performed according the framework depicted in Fig. \ref{['fig:framework']}.
  • Figure 3: The number of outputs evaluated at each STEP of the framework, showing the percentage reduction compared to the initial number of outputs evaluated in STEP 1 by non-experts. Results are aggregated for 11 models and prompts from the given task: A) Task 1, B) Task 2.
  • Figure A.1: The number of outputs evaluated at each STEP of the framework in Task 1 for simple prompts, showing the percentage of results evaluated by non-expert and expert. Results are aggregated for prompts P1-P2 from the given model.
  • Figure A.2: The number of outputs evaluated at each STEP of the framework in Task 1 for context-based prompts, showing the percentage of results evaluated by non-expert and expert. Results are aggregated for prompts P3-P4 from the given model.
  • ...and 3 more figures