Table of Contents
Fetching ...

Prompting Strategies for Enabling Large Language Models to Infer Causation from Correlation

Eleni Sgouritsa, Virginia Aglietti, Yee Whye Teh, Arnaud Doucet, Arthur Gretton, Silvia Chiappa

TL;DR

This work tackles the challenge of inferring causal structure from correlation statements using large language models. It introduces PC-SubQ, a prompting strategy that decomposes natural language causal discovery (NL-CD) into fixed steps of the PC algorithm, guided by sequential subquestions with minimal historical context. Across five LLMs and the Corr2Cause benchmark, PC-SubQ achieves higher F1 and accuracy than baseline prompting strategies and demonstrates robustness to variable renaming, paraphrasing, and natural-story inputs, while providing transparent reasoning traces. The approach achieves strong results without model fine-tuning and suggests a general, interpretable framework for algorithmic reasoning tasks in NLP-enabled causal inference.

Abstract

The reasoning abilities of Large Language Models (LLMs) are attracting increasing attention. In this work, we focus on causal reasoning and address the task of establishing causal relationships based on correlation information, a highly challenging problem on which several LLMs have shown poor performance. We introduce a prompting strategy for this problem that breaks the original task into fixed subquestions, with each subquestion corresponding to one step of a formal causal discovery algorithm, the PC algorithm. The proposed prompting strategy, PC-SubQ, guides the LLM to follow these algorithmic steps, by sequentially prompting it with one subquestion at a time, augmenting the next subquestion's prompt with the answer to the previous one(s). We evaluate our approach on an existing causal benchmark, Corr2Cause: our experiments indicate a performance improvement across five LLMs when comparing PC-SubQ to baseline prompting strategies. Results are robust to causal query perturbations, when modifying the variable names or paraphrasing the expressions.

Prompting Strategies for Enabling Large Language Models to Infer Causation from Correlation

TL;DR

This work tackles the challenge of inferring causal structure from correlation statements using large language models. It introduces PC-SubQ, a prompting strategy that decomposes natural language causal discovery (NL-CD) into fixed steps of the PC algorithm, guided by sequential subquestions with minimal historical context. Across five LLMs and the Corr2Cause benchmark, PC-SubQ achieves higher F1 and accuracy than baseline prompting strategies and demonstrates robustness to variable renaming, paraphrasing, and natural-story inputs, while providing transparent reasoning traces. The approach achieves strong results without model fine-tuning and suggests a general, interpretable framework for algorithmic reasoning tasks in NLP-enabled causal inference.

Abstract

The reasoning abilities of Large Language Models (LLMs) are attracting increasing attention. In this work, we focus on causal reasoning and address the task of establishing causal relationships based on correlation information, a highly challenging problem on which several LLMs have shown poor performance. We introduce a prompting strategy for this problem that breaks the original task into fixed subquestions, with each subquestion corresponding to one step of a formal causal discovery algorithm, the PC algorithm. The proposed prompting strategy, PC-SubQ, guides the LLM to follow these algorithmic steps, by sequentially prompting it with one subquestion at a time, augmenting the next subquestion's prompt with the answer to the previous one(s). We evaluate our approach on an existing causal benchmark, Corr2Cause: our experiments indicate a performance improvement across five LLMs when comparing PC-SubQ to baseline prompting strategies. Results are robust to causal query perturbations, when modifying the variable names or paraphrasing the expressions.

Paper Structure

This paper contains 20 sections, 6 figures, 17 tables.

Figures (6)

  • Figure 1: The 8 fixed subquestions of PC-SubQ. [Premise] and [Hypothesis] are placeholders for the input Premise and Hypothesis, respectively, while [Answer to SubQi] and [Final Answer] are placeholders for the output intermediate and final answers, respectively. [...] represents some reasoning text that we expect from the LLM. The colors demonstrate how the answers to previous subquestions are passed as input to the next ones. Few-shot CoT examples are prepended to each subquestion (see Fig. \ref{['fig:subquestions_shots']})
  • Figure 2: Indicative few-shot examples prepended to PC-SubQ subquestions (see Fig. \ref{['fig:subquestions']}).
  • Figure 3: Five prompting strategies to compare PC-SubQ with. [Premise] and [Hypothesis] are placeholders for the Premise and Hypothesis, respectively, while [...] represents some reasoning text that we expect from the LLM. For both few-shot strategies only one indicative example is shown.
  • Figure 4: F1-score and accuracy metrics for a range of prompting strategies using (a) Gemini Pro 1.0., (b) Gemini Ultra 1.0, (c) PaLM 2 L, (d) GPT-3.5-turbo and (e) GPT-4-turbo. PC-SubQ outperforms all other prompting strategies as measured by our main metric (F1-score).
  • Figure 5: Left: Performance using original, refactored and paraphrased PC-SubQ prompts, showing robustness to these perturbations. Right: PC-SubQ reasoning and answers on two natural story examples. Output is correct even though natural story examples were never presented as few-shots.
  • ...and 1 more figures