Table of Contents
Fetching ...

CARE: Turning LLMs Into Causal Reasoning Expert

Juncheng Dong, Yiling Liu, Ahmed Aloui, Vahid Tarokh, David Carlson

TL;DR

The paper investigates the limits of prompting LLMs for causal discovery, revealing heavy reliance on variable-name semantics and weak data-driven inference. It introduces CARE, a supervised fine-tuning framework that teaches LLMs to synthesize their internal world knowledge with structured outputs from classical causal-discovery algorithms, guided by diverse data augmentations. CARE leverages parameter-efficient fine-tuning and an automatic LLM-as-judge evaluation to demonstrate state-of-the-art performance on benchmark networks, including scenarios with permuted names and partial observations. The work shows that a compact LLM, when properly trained, can outperform traditional causal discovery methods and much larger models, highlighting a scalable path to robust causal reasoning in AI systems.

Abstract

Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs' behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs' performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs' causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.

CARE: Turning LLMs Into Causal Reasoning Expert

TL;DR

The paper investigates the limits of prompting LLMs for causal discovery, revealing heavy reliance on variable-name semantics and weak data-driven inference. It introduces CARE, a supervised fine-tuning framework that teaches LLMs to synthesize their internal world knowledge with structured outputs from classical causal-discovery algorithms, guided by diverse data augmentations. CARE leverages parameter-efficient fine-tuning and an automatic LLM-as-judge evaluation to demonstrate state-of-the-art performance on benchmark networks, including scenarios with permuted names and partial observations. The work shows that a compact LLM, when properly trained, can outperform traditional causal discovery methods and much larger models, highlighting a scalable path to robust causal reasoning in AI systems.

Abstract

Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs' behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs' performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs' causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.

Paper Structure

This paper contains 33 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Synergizing LLM Knowledge with Strengths of Causal Discovery Algorithms. (a) Standalone approaches often falter: LLMs can misinterpret data due to over-reliance on pre-trained knowledge, while traditional Causal Discovery (CD) algorithms may struggle with insufficient data or without strong guiding priors, both potentially leading to incorrect causal graphs. (b)CARE uses supervised fine-tuning to synergistically integrate the extensive world knowledge of LLMs with the data-driven evidence from CD algorithm outputs, aiming to produce more accurate causal discoveries.
  • Figure 2: CARE Framework Overview.CARE processes causal discovery datasets, augments them to create diverse training scenarios ( left), and uses these to construct prompt/answer pairs ( middle-left). These pairs are then used for supervised fine-tuning ( middle-right), enabling LLMs to learn and output accurate causal relationships among variables ( right).
  • Figure 3: Purposes of Various Augmentation Methods. Each row illustrates a potential LLM bias ( left), the corresponding augmentation operation designed to address it ( middle), and the intended learning objective for the model ( right).
  • Figure 4: Ground-truth DAG for the ASIA network (8 nodes) lauritzen1988local.
  • Figure 5: Ground-truth DAG for the SURVEY network (6 nodes) scutari2021bayesian.
  • ...and 2 more figures