Table of Contents
Fetching ...

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Guido Zuccon, Bevan Koopman

TL;DR

The paper investigates how prompt-provided knowledge interacts with the model's encoded knowledge in health information seeking using a two-condition design: question-only prompts and evidence-biased prompts. Using 100 topics from the TREC Health Misinformation track, it finds ChatGPT is about $80\%$ accurate when relying on model knowledge alone, but introducing external evidence through prompts can overturn correct answers and reduce overall accuracy to about $63\%$. This highlights significant risks in retrieve-then-generate health QA pipelines and prompts, underscoring the need for robust, transparent mechanisms to manage evidence influence. The findings inform safer prompt design and system integration for health information applications.

Abstract

Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

TL;DR

The paper investigates how prompt-provided knowledge interacts with the model's encoded knowledge in health information seeking using a two-condition design: question-only prompts and evidence-biased prompts. Using 100 topics from the TREC Health Misinformation track, it finds ChatGPT is about accurate when relying on model knowledge alone, but introducing external evidence through prompts can overturn correct answers and reduce overall accuracy to about . This highlights significant risks in retrieve-then-generate health QA pipelines and prompts, underscoring the need for robust, transparent mechanisms to manage evidence influence. The findings inform safer prompt design and system integration for health information applications.

Abstract

Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.
Paper Structure (9 sections, 6 figures)

This paper contains 9 sections, 6 figures.

Figures (6)

  • Figure 1: GPTChat prompt format for determining general effectiveness (RQ1) on TREC Misinformation topics.
  • Figure 2: Effectiveness of ChatGPT when prompting for "Yes/No" answers to TREC Misinformation questions.
  • Figure 3: ChatGPT Prompt used to determine what impact a supportive or contrary passage has on answer correctness.
  • Figure 4: Effectiveness of ChatGPT when prompting with either a supporting or contrary evidence passage from TREC Misinformation qrels.
  • Figure 5:
  • ...and 1 more figures