Table of Contents
Fetching ...

Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis

Ruibo Tu, Chao Ma, Cheng Zhang

TL;DR

This paper probes whether ChatGPT can perform causal-discovery reasoning in a medical context (neuropathic pain) by using a benchmark and sampling positive/negative causal relations framed as true/false questions. It reports that ChatGPT achieves high precision but low recall, with results that are inconsistent across days and hindered by domain-knowledge gaps and language-specific issues. The findings emphasize the risk of treating LLM-derived causal claims as true causal discoveries and suggest potential complementary roles for LLMs to augment, rather than replace, causal discovery methods. The work points to opportunities for tighter integration of ChatGPT-like models with causal ML tools to improve reliability and utility in clinical causal inference.

Abstract

ChatGPT has demonstrated exceptional proficiency in natural language conversation, e.g., it can answer a wide range of questions while no previous large language models can. Thus, we would like to push its limit and explore its ability to answer causal discovery questions by using a medical benchmark (Tu et al. 2019) in causal discovery.

Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis

TL;DR

This paper probes whether ChatGPT can perform causal-discovery reasoning in a medical context (neuropathic pain) by using a benchmark and sampling positive/negative causal relations framed as true/false questions. It reports that ChatGPT achieves high precision but low recall, with results that are inconsistent across days and hindered by domain-knowledge gaps and language-specific issues. The findings emphasize the risk of treating LLM-derived causal claims as true causal discoveries and suggest potential complementary roles for LLMs to augment, rather than replace, causal discovery methods. The work points to opportunities for tighter integration of ChatGPT-like models with causal ML tools to improve reliability and utility in clinical causal inference.

Abstract

ChatGPT has demonstrated exceptional proficiency in natural language conversation, e.g., it can answer a wide range of questions while no previous large language models can. Thus, we would like to push its limit and explore its ability to answer causal discovery questions by using a medical benchmark (Tu et al. 2019) in causal discovery.
Paper Structure (3 sections, 5 figures, 3 tables)

This paper contains 3 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Dermatome map map as a reference for this benchmark.
  • Figure 2: Example showing that ChatGPT can correctly answer the question and provide reasonable explanations.
  • Figure 3: The lower abdominal is the region where T12 nerve passes. If looking at the dermatome map \ref{['fig:dermatome_map']}, it is easy to identify lower back, hip, and lower abdominal discomfort can all be caused by T12 radiculopathy.
  • Figure 4: Example showing that ChatGPT fails to understand the region on the body. The area around the key bone is largely overlapping with the front shoulder area especially when the patient describes the symptoms.
  • Figure 5: Example showing that ChatGPT can identify foreign language time by time and that it is not very reliable.