Table of Contents
Fetching ...

Is ChatGPT the Future of Causal Text Mining? A Comprehensive Evaluation and Analysis

Takehiro Takayanagi, Masahiro Suzuki, Ryotaro Kobayashi, Hiroki Sakaji, Kiyoshi Izumi

TL;DR

This paper assesses ChatGPT's effectiveness for causal text mining (CTM) using a broad benchmark that spans general English, finance, and Japanese data. It defines two CTM tasks—causal sequence classification and span detection—and an evaluation framework that enables fair comparisons with encoder-based baselines. The results show that encoder models (e.g., DeBERTaV3) typically outperform ChatGPT when training data are available, while ChatGPT can be competitive in zero-shot settings; GPT-4, however, exhibits causal hallucinations and struggles with complex causality, especially in domain adaptation. The work provides a public benchmark, standardized prompts, and insights into the limitations and future challenges of CTM with LLMs, guiding researchers toward more robust, domain-aware methods.

Abstract

Causality is fundamental in human cognition and has drawn attention in diverse research fields. With growing volumes of textual data, discerning causalities within text data is crucial, and causal text mining plays a pivotal role in extracting meaningful patterns. This study conducts comprehensive evaluations of ChatGPT's causal text mining capabilities. Firstly, we introduce a benchmark that extends beyond general English datasets, including domain-specific and non-English datasets. We also provide an evaluation framework to ensure fair comparisons between ChatGPT and previous approaches. Finally, our analysis outlines the limitations and future challenges in employing ChatGPT for causal text mining. Specifically, our analysis reveals that ChatGPT serves as a good starting point for various datasets. However, when equipped with a sufficient amount of training data, previous models still surpass ChatGPT's performance. Additionally, ChatGPT suffers from the tendency to falsely recognize non-causal sequences as causal sequences. These issues become even more pronounced with advanced versions of the model, such as GPT-4. In addition, we highlight the constraints of ChatGPT in handling complex causality types, including both intra/inter-sentential and implicit causality. The model also faces challenges with effectively leveraging in-context learning and domain adaptation. We release our code to support further research and development in this field.

Is ChatGPT the Future of Causal Text Mining? A Comprehensive Evaluation and Analysis

TL;DR

This paper assesses ChatGPT's effectiveness for causal text mining (CTM) using a broad benchmark that spans general English, finance, and Japanese data. It defines two CTM tasks—causal sequence classification and span detection—and an evaluation framework that enables fair comparisons with encoder-based baselines. The results show that encoder models (e.g., DeBERTaV3) typically outperform ChatGPT when training data are available, while ChatGPT can be competitive in zero-shot settings; GPT-4, however, exhibits causal hallucinations and struggles with complex causality, especially in domain adaptation. The work provides a public benchmark, standardized prompts, and insights into the limitations and future challenges of CTM with LLMs, guiding researchers toward more robust, domain-aware methods.

Abstract

Causality is fundamental in human cognition and has drawn attention in diverse research fields. With growing volumes of textual data, discerning causalities within text data is crucial, and causal text mining plays a pivotal role in extracting meaningful patterns. This study conducts comprehensive evaluations of ChatGPT's causal text mining capabilities. Firstly, we introduce a benchmark that extends beyond general English datasets, including domain-specific and non-English datasets. We also provide an evaluation framework to ensure fair comparisons between ChatGPT and previous approaches. Finally, our analysis outlines the limitations and future challenges in employing ChatGPT for causal text mining. Specifically, our analysis reveals that ChatGPT serves as a good starting point for various datasets. However, when equipped with a sufficient amount of training data, previous models still surpass ChatGPT's performance. Additionally, ChatGPT suffers from the tendency to falsely recognize non-causal sequences as causal sequences. These issues become even more pronounced with advanced versions of the model, such as GPT-4. In addition, we highlight the constraints of ChatGPT in handling complex causality types, including both intra/inter-sentential and implicit causality. The model also faces challenges with effectively leveraging in-context learning and domain adaptation. We release our code to support further research and development in this field.
Paper Structure (15 sections, 5 figures, 5 tables)

This paper contains 15 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustrations of prompts for the causal sequence classification task and the causal span detection task. Regions outlined by dotted lines represent demonstrations of the few-shot setting and are omitted under the zero-shot setting. All text examples are sourced from the FinCausal 2020 dataset mariko-etal-2020-financial.
  • Figure 2: Performance comparisons for inter and intra-sentential causality in the causal sequence classification task on the ESL, PDTB, and Fincausal datasets, respectively.
  • Figure 3: Performance comparisons for different causality types in causal span detection tasks. The left two bar plots represent performance for inter and intra-sentential causality on the PDTB and FinCausal datasets, while the right plot illustrates performance for explicit and implicit causality on the PDTB dataset.
  • Figure 4: Performance of ChatGPT on causal sequence classification across multiple datasets with varying number of demonstrations.
  • Figure 5: Performance of ChatGPT on causal span detection across multiple datasets with varying number of demonstrations.