Table of Contents
Fetching ...

CLadder: Assessing Causal Reasoning in Language Models

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, Bernhard Schölkopf

TL;DR

This work investigates whether large language models can perform formal causal reasoning across the three rungs of Pearl's Ladder of Causation. It introduces CLadder, a 10K-question benchmark with ground-truth estimands derived from a Causal Inference Engine and verbalized as natural-language stories, together with step-by-step explanations. To cultivate robust causal reasoning, it proposes CausalCoT, a structured chain-of-thought prompting approach that guides models to extract causal graphs, specify queries, and compute estimands, yielding 70.40% overall accuracy and outperforming vanilla GPT-4 by 8.37 points. The study provides extensive analyses, including data-contamination controls and fine-grained error analysis, highlighting current limitations and outlining paths toward more reliable causal reasoning in language models, with open-source data and tooling for broader reuse.

Abstract

The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.

CLadder: Assessing Causal Reasoning in Language Models

TL;DR

This work investigates whether large language models can perform formal causal reasoning across the three rungs of Pearl's Ladder of Causation. It introduces CLadder, a 10K-question benchmark with ground-truth estimands derived from a Causal Inference Engine and verbalized as natural-language stories, together with step-by-step explanations. To cultivate robust causal reasoning, it proposes CausalCoT, a structured chain-of-thought prompting approach that guides models to extract causal graphs, specify queries, and compute estimands, yielding 70.40% overall accuracy and outperforming vanilla GPT-4 by 8.37 points. The study provides extensive analyses, including data-contamination controls and fine-grained error analysis, highlighting current limitations and outlining paths toward more reliable causal reasoning in language models, with open-source data and tooling for broader reuse.

Abstract

The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.
Paper Structure (68 sections, 4 equations, 11 figures, 9 tables)

This paper contains 68 sections, 4 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Example question in our CLadder dataset featuring an instance of Simpson's paradoxpearl2022comment. We generate the following (symbolic) triple: (i) the causal query; (ii) the ground-truth answer, derived through a causal inference enginepearl2018book; and (iii) a step-by-step explanation. We then verbalize these questions by turning them into stories, inspired by examples from the causality literature, which can be expressed in natural language.
  • Figure 1: Statistics of our CLadder dataset v1.5.
  • Figure 2: The data-generating process of the CLadder dataset. The upper part of the figure describes the formal part of the question generation, which samples inputs for the CI Engine and derives a ground truth answer. The bottom part describes the natural language part of the question generation---i.e., its verbalization, based on multiple stories and different degrees of alignment with commonsense knowledge.
  • Figure 3: Distributions of query types in our 10K data.
  • Figure 4: Illustration of our CausalCoT prompting strategy, which designs a chain of subquestions inspired by the idea of a CI engine pearl2018book.
  • ...and 6 more figures