Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis
Ruibo Tu, Chao Ma, Cheng Zhang
TL;DR
This paper probes whether ChatGPT can perform causal-discovery reasoning in a medical context (neuropathic pain) by using a benchmark and sampling positive/negative causal relations framed as true/false questions. It reports that ChatGPT achieves high precision but low recall, with results that are inconsistent across days and hindered by domain-knowledge gaps and language-specific issues. The findings emphasize the risk of treating LLM-derived causal claims as true causal discoveries and suggest potential complementary roles for LLMs to augment, rather than replace, causal discovery methods. The work points to opportunities for tighter integration of ChatGPT-like models with causal ML tools to improve reliability and utility in clinical causal inference.
Abstract
ChatGPT has demonstrated exceptional proficiency in natural language conversation, e.g., it can answer a wide range of questions while no previous large language models can. Thus, we would like to push its limit and explore its ability to answer causal discovery questions by using a medical benchmark (Tu et al. 2019) in causal discovery.
