Is ChatGPT the Future of Causal Text Mining? A Comprehensive Evaluation and Analysis
Takehiro Takayanagi, Masahiro Suzuki, Ryotaro Kobayashi, Hiroki Sakaji, Kiyoshi Izumi
TL;DR
This paper assesses ChatGPT's effectiveness for causal text mining (CTM) using a broad benchmark that spans general English, finance, and Japanese data. It defines two CTM tasks—causal sequence classification and span detection—and an evaluation framework that enables fair comparisons with encoder-based baselines. The results show that encoder models (e.g., DeBERTaV3) typically outperform ChatGPT when training data are available, while ChatGPT can be competitive in zero-shot settings; GPT-4, however, exhibits causal hallucinations and struggles with complex causality, especially in domain adaptation. The work provides a public benchmark, standardized prompts, and insights into the limitations and future challenges of CTM with LLMs, guiding researchers toward more robust, domain-aware methods.
Abstract
Causality is fundamental in human cognition and has drawn attention in diverse research fields. With growing volumes of textual data, discerning causalities within text data is crucial, and causal text mining plays a pivotal role in extracting meaningful patterns. This study conducts comprehensive evaluations of ChatGPT's causal text mining capabilities. Firstly, we introduce a benchmark that extends beyond general English datasets, including domain-specific and non-English datasets. We also provide an evaluation framework to ensure fair comparisons between ChatGPT and previous approaches. Finally, our analysis outlines the limitations and future challenges in employing ChatGPT for causal text mining. Specifically, our analysis reveals that ChatGPT serves as a good starting point for various datasets. However, when equipped with a sufficient amount of training data, previous models still surpass ChatGPT's performance. Additionally, ChatGPT suffers from the tendency to falsely recognize non-causal sequences as causal sequences. These issues become even more pronounced with advanced versions of the model, such as GPT-4. In addition, we highlight the constraints of ChatGPT in handling complex causality types, including both intra/inter-sentential and implicit causality. The model also faces challenges with effectively leveraging in-context learning and domain adaptation. We release our code to support further research and development in this field.
