Table of Contents
Fetching ...

Understanding Causality with Large Language Models: Feasibility and Opportunities

Cheng Zhang, Stefan Bauer, Paul Bennett, Jiangfeng Gao, Wenbo Gong, Agrin Hilmkil, Joel Jennings, Chao Ma, Tom Minka, Nick Pawlowski, James Vaughan

TL;DR

The paper evaluates how well large language models can answer causal questions, identifying strong performance on knowledge-based causal inquiries but significant gaps in discovering new causal relationships and making high-precision, high-stakes decisions. It argues that bridging these gaps requires integrating causal machine learning with LLMs, either through modular causal components or a new causality-aware training paradigm. The proposed directions aim to improve trust, efficiency, and applicability of LLMs in real-world causal reasoning across domains. If realized, these approaches could substantially broaden the impact of LLMs on science, industry, and decision making.

Abstract

We assess the ability of large language models (LLMs) to answer causal questions by analyzing their strengths and weaknesses against three types of causal question. We believe that current LLMs can answer causal questions with existing causal knowledge as combined domain experts. However, they are not yet able to provide satisfactory answers for discovering new knowledge or for high-stakes decision-making tasks with high precision. We discuss possible future directions and opportunities, such as enabling explicit and implicit causal modules as well as deep causal-aware LLMs. These will not only enable LLMs to answer many different types of causal questions for greater impact but also enable LLMs to be more trustworthy and efficient in general.

Understanding Causality with Large Language Models: Feasibility and Opportunities

TL;DR

The paper evaluates how well large language models can answer causal questions, identifying strong performance on knowledge-based causal inquiries but significant gaps in discovering new causal relationships and making high-precision, high-stakes decisions. It argues that bridging these gaps requires integrating causal machine learning with LLMs, either through modular causal components or a new causality-aware training paradigm. The proposed directions aim to improve trust, efficiency, and applicability of LLMs in real-world causal reasoning across domains. If realized, these approaches could substantially broaden the impact of LLMs on science, industry, and decision making.

Abstract

We assess the ability of large language models (LLMs) to answer causal questions by analyzing their strengths and weaknesses against three types of causal question. We believe that current LLMs can answer causal questions with existing causal knowledge as combined domain experts. However, they are not yet able to provide satisfactory answers for discovering new knowledge or for high-stakes decision-making tasks with high precision. We discuss possible future directions and opportunities, such as enabling explicit and implicit causal modules as well as deep causal-aware LLMs. These will not only enable LLMs to answer many different types of causal questions for greater impact but also enable LLMs to be more trustworthy and efficient in general.
Paper Structure (16 sections, 10 figures)

This paper contains 16 sections, 10 figures.

Figures (10)

  • Figure 1: An example of a good answer to \ref{['item:type1']} question. The answer is correct and clear explanations are provided. It can clearly identify if a causal relationship exisit from domain knowledge.
  • Figure 2: In addition to \ref{['fig:goodq0']}, we tested it with different names of the same region, for example "muscles close to the upper front collarbone area" instead of "shoulder area". We observe that the perfromance is quite stable.
  • Figure 3: With this example, we can see that LLMs can understand the consequence of an action which is a basic causal inference task in a domain with common knowledge.
  • Figure 4: GPT4 cannot answer this type of question. With more assumptions given, it can only explain the meaning of these assumptions without answering the causal question.
  • Figure 5: GPT4 can identify that such a question requires causal discovery methods and tries to recommend a causal discovery toolbox. This is a great step as this provides the possibility to answer such questions with API access and they can take advantage of the advances of the causal machine learning research. However, the recommended methods are not suitable. For the first question, I have already provided the assumption that the relationships are linear and the LLM recommends two non-linear methods, one of which is designed for time-series. For the second question, the recommended method is OK in theory but it is not the most efficient one to handle such scale data.
  • ...and 5 more figures