ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Martina Miliani; Serena Auriemma; Alessandro Bondielli; Emmanuele Chersoni; Lucia Passaro; Irene Sucameli; Alessandro Lenci

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci

TL;DR

ExpliCa introduces a rigorously constructed dataset that jointly probes explicit causal and temporal relations between events via clearly defined connectives, with crowdsourced human ratings to ground truth. The paper deploys a dual evaluation framework—prompting-based responses and perplexity-based scoring (APS)—to assess both model performance and internal competence across seven LLMs. Findings show that even strong models struggle to reach high accuracy, with perplexity scores often revealing latent causal knowledge not fully exploited in prompting, and results strongly influenced by connective type, event order, and model size. This work provides a nuanced view of how explicit linguistic cues shape causal reasoning in LLMs and offers a valuable benchmark for advancing interpretable, temporally aware reasoning in large language models.

Abstract

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

TL;DR

Abstract

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)