Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation

Jinglong Gao; Xiao Ding; Bing Qin; Ting Liu

Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation

Jinglong Gao, Xiao Ding, Bing Qin, Ting Liu

TL;DR

The paper conducts a comprehensive, zero-shot evaluation of four ChatGPT versions across three causal reasoning tasks (ECI, CD, CEG) using diverse datasets (ESC, CTB, MAVEN-ERE, COPA, e-CARE). It finds that ChatGPT is not a reliable causal reasoner but can generate high-quality causal explanations, albeit with pronounced causal hallucinations likely tied to RLHF and natural language biases. In-context learning and chain-of-thought prompts can unintentionally amplify hallucinations, and performance is highly sensitive to prompt formulation, event density, and lexical distance. The study provides nuanced insights into explicit versus implicit causality and demonstrates that open-ended prompts are generally detrimental for this domain, with code available for replication. Overall, the work highlights important limitations and directions for improving causal reasoning in large language models beyond prompt engineering.

Abstract

Causal reasoning ability is crucial for numerous NLP applications. Despite the impressive emerging ability of ChatGPT in various NLP tasks, it is unclear how well ChatGPT performs in causal reasoning. In this paper, we conduct the first comprehensive evaluation of the ChatGPT's causal reasoning capabilities. Experiments show that ChatGPT is not a good causal reasoner, but a good causal explainer. Besides, ChatGPT has a serious hallucination on causal reasoning, possibly due to the reporting biases between causal and non-causal relationships in natural language, as well as ChatGPT's upgrading processes, such as RLHF. The In-Context Learning (ICL) and Chain-of-Thought (CoT) techniques can further exacerbate such causal hallucination. Additionally, the causal reasoning ability of ChatGPT is sensitive to the words used to express the causal concept in prompts, and close-ended prompts perform better than open-ended prompts. For events in sentences, ChatGPT excels at capturing explicit causality rather than implicit causality, and performs better in sentences with lower event density and smaller lexical distance between events. The code is available on https://github.com/ArrogantL/ChatGPT4CausalReasoning .

Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation

TL;DR

Abstract

Paper Structure (30 sections, 8 figures, 12 tables)

This paper contains 30 sections, 8 figures, 12 tables.

Introduction
Key takeaways
Related Work
Causal Reasoning in NLP
Evaluation of ChatGPT's Capabilities
Evaluation Settings
Datasets and Evaluation Metrics
Event Causality Identification
Causal Discovery
Causal Explanation Generation
Experiment Setting
Baselines
Experimental Results
Event Causality Identification
Causal Discovery
...and 15 more sections

Figures (8)

Figure 1: The forms of three causal reasoning tasks and the prompts we use. The content that requires ChatGPT to reply is marked in red. The multiple-choice CD task also involves samples that ask for selecting the result of the input event. For such samples, we modify the "cause" in the question to "result".
Figure 2: Prompts that express causal concepts in various ways. The content that requires ChatGPT to reply is marked in red.
Figure 3: Performance of ChatGPT on pairs of events with different lexical distances in the ESC dataset.
Figure 4: Performance of ChatGPT on sentences with different numbers of events in the ESC dataset.
Figure 5: Performance of ChatGPT on pairs of events with different types of causality in the ESC dataset.
...and 3 more figures

Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation

TL;DR

Abstract

Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)