Table of Contents
Fetching ...

ChatGPT and post-test probability

Samuel J. Weisenthal

TL;DR

This paper interrogates whether ChatGPT can perform formal probabilistic medical diagnostic reasoning using Bayes' rule, comparing prompts that use pure probability notation to those employing medical terminology. It systematically tallies errors across 20 replicates for each prompting style and shows that introducing medical variable names increases error rates, especially when conditioning on covariates like cough. Despite prompt engineering that reduces certain errors, ChatGPT still makes nontrivial probabilistic mistakes, suggesting limits to pure rule-based reasoning in current LLMs. The results imply a need for hybrid systems that combine symbolic probabilistic solvers with language models, and they point to opportunities for educational resources that bridge probability theory and medical diagnostics. Collectively, the work informs future directions in making LLMs more reliable for probabilistic reasoning in clinical contexts while highlighting risks and design implications for healthcare AI deployments.

Abstract

Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a posterior of A given B and C) to queries that use terminology from medical diagnosis (e.g., requests for a posterior probability of Covid given a test result and cough). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.

ChatGPT and post-test probability

TL;DR

This paper interrogates whether ChatGPT can perform formal probabilistic medical diagnostic reasoning using Bayes' rule, comparing prompts that use pure probability notation to those employing medical terminology. It systematically tallies errors across 20 replicates for each prompting style and shows that introducing medical variable names increases error rates, especially when conditioning on covariates like cough. Despite prompt engineering that reduces certain errors, ChatGPT still makes nontrivial probabilistic mistakes, suggesting limits to pure rule-based reasoning in current LLMs. The results imply a need for hybrid systems that combine symbolic probabilistic solvers with language models, and they point to opportunities for educational resources that bridge probability theory and medical diagnostics. Collectively, the work informs future directions in making LLMs more reliable for probabilistic reasoning in clinical contexts while highlighting risks and design implications for healthcare AI deployments.

Abstract

Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a posterior of A given B and C) to queries that use terminology from medical diagnosis (e.g., requests for a posterior probability of Covid given a test result and cough). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.
Paper Structure (26 sections, 17 equations, 4 tables)