Table of Contents
Fetching ...

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci

TL;DR

ExpliCa introduces a rigorously constructed dataset that jointly probes explicit causal and temporal relations between events via clearly defined connectives, with crowdsourced human ratings to ground truth. The paper deploys a dual evaluation framework—prompting-based responses and perplexity-based scoring (APS)—to assess both model performance and internal competence across seven LLMs. Findings show that even strong models struggle to reach high accuracy, with perplexity scores often revealing latent causal knowledge not fully exploited in prompting, and results strongly influenced by connective type, event order, and model size. This work provides a nuanced view of how explicit linguistic cues shape causal reasoning in LLMs and offers a valuable benchmark for advancing interpretable, temporally aware reasoning in large language models.

Abstract

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

TL;DR

ExpliCa introduces a rigorously constructed dataset that jointly probes explicit causal and temporal relations between events via clearly defined connectives, with crowdsourced human ratings to ground truth. The paper deploys a dual evaluation framework—prompting-based responses and perplexity-based scoring (APS)—to assess both model performance and internal competence across seven LLMs. Findings show that even strong models struggle to reach high accuracy, with perplexity scores often revealing latent causal knowledge not fully exploited in prompting, and results strongly influenced by connective type, event order, and model size. This work provides a nuanced view of how explicit linguistic cues shape causal reasoning in LLMs and offers a valuable benchmark for advancing interpretable, temporally aware reasoning in large language models.

Abstract

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

Paper Structure

This paper contains 30 sections, 14 figures, 14 tables.

Figures (14)

  • Figure 1: An overview of the contributions of this paper. On the left, the ExpliCa dataset, annotated with human acceptability ratings. On the right, our evaluation framework which leverages LLMs through PPL and prompting.
  • Figure 2: Number of sentence pairs (categorized by relation type and order) in each frequency bin.
  • Figure 3: Average models' accuracy (on prompting tasks and perplexity) obtained from all the tasks on causal and temporal related sentence pairs. The numbers on top of each bar represent the standard deviation.
  • Figure 4: Accuracy of models by relation type and order. APS stands for Accuracy Perplexity score.
  • Figure 5: Distribution of acceptability ratings for humans and Gpt4o and Gemma, and Falcon's normalized perplexity.
  • ...and 9 more figures