Evaluating Interventional Reasoning Capabilities of Large Language Models

Tejas Kasetty; Divyat Mahajan; Gintare Karolina Dziugaite; Alexandre Drouin; Dhanya Sridhar

Evaluating Interventional Reasoning Capabilities of Large Language Models

Tejas Kasetty, Divyat Mahajan, Gintare Karolina Dziugaite, Alexandre Drouin, Dhanya Sridhar

TL;DR

This work investigates how large language models update their beliefs about causal relationships after interventions. It introduces intervention effects (IE) as a zero-shot, prompt-based binary task across three canonical DAGs (bivariate, confounding, mediation) to isolate causal-reasoning capabilities from memorization or surface-text shortcuts. Through three benchmarks (Random Char, Tübingen, Random Tübingen) and six models, the study finds GPT-4 family models, especially GPT-4-turbo, demonstrate strong interventional reasoning, while LLaMA variants lag and memorization plays a limited role. The findings advance understanding of abstract causal reasoning in LLMs and point to future work on broader graphs, causal-identification tasks, and methods to enhance non-GPT models.

Abstract

Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. We evaluate six LLMs on the benchmarks, finding that GPT models show promising accuracy at predicting the intervention effects.

Evaluating Interventional Reasoning Capabilities of Large Language Models

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 3 figures, 9 tables)

This paper contains 23 sections, 4 equations, 3 figures, 9 tables.

Introduction
Related Works
Understanding changes in Causal Relationships via Intervention Effects
Background
Intervention effects
Evaluating LLMs on Intervention Effects
Empirical Analysis
Discussion and Limitations
Experiment Setup Details
Causal DAGs
Dataset Generation
Random Char (R).
Tübingen (T).
Random Tübingen (RT).
Details regarding LLMs
...and 8 more sections

Figures (3)

Figure 1: An illustration of the mapping functions that verbalize the information for a single intervention effect estimation task into a prompt.
Figure 2: Causal DAGs. In the empirical studies, we define intervention effects based on three causal DAGs: bivariate, confounding and mediation graphs.
Figure 3: Prompt design for intervention reasoning in the substitution task. Instead of using the word intervention directly in the prompt, we illustrate the concept of interventions and define it with a random word, for instance, operation 'xyz' in the prompt template above.

Theorems & Definitions (2)

Definition 1: Causal relation
Definition 2: Intervention effect

Evaluating Interventional Reasoning Capabilities of Large Language Models

TL;DR

Abstract

Evaluating Interventional Reasoning Capabilities of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (2)