Table of Contents
Fetching ...

Evaluating Large Language Models for Health-related Queries with Presuppositions

Navreet Kaur, Monojit Choudhury, Danish Pruthi

TL;DR

UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions, is introduced and it is found that while model responses rarely disagree with true health claims, they often fail to challenge false claims.

Abstract

As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.

Evaluating Large Language Models for Health-related Queries with Presuppositions

TL;DR

UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions, is introduced and it is found that while model responses rarely disagree with true health claims, they often fail to challenge false claims.

Abstract

As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
Paper Structure (44 sections, 3 equations, 8 figures, 15 tables)

This paper contains 44 sections, 3 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Given a health-related claim, we pose queries to the model with increasing levels of presupposition. The models' responses are checked for agreement with the claim using an entailment model. Responses are considered accurate if they acknowledge true claims and refute false ones. We also assess if the responses are consistent.
  • Figure 2: Percentage of model responses that agree, disagree and are neutral with respect to the true, mixed and false claims in queries with increasing doses of presuppositions. A large proportion of claims (even false ones) are supported by models---the fraction increases for InstructGPT, ChatGPT and GPT-4 upon increasing presuppositions.
  • Figure 3: Percentage of responses from the BiMediX model that agree, disagree and are neutral with respect to false claims (across different presupposition levels).
  • Figure 4: Percentage of model responses that agree, disagree and are neutral with respect to fabricated claims.
  • Figure 5: Consistency of models for true, mixed and false claims, measured as the fraction of claims where stance in model responses is consistent across all levels.
  • ...and 3 more figures