CARE: Turning LLMs Into Causal Reasoning Expert
Juncheng Dong, Yiling Liu, Ahmed Aloui, Vahid Tarokh, David Carlson
TL;DR
The paper investigates the limits of prompting LLMs for causal discovery, revealing heavy reliance on variable-name semantics and weak data-driven inference. It introduces CARE, a supervised fine-tuning framework that teaches LLMs to synthesize their internal world knowledge with structured outputs from classical causal-discovery algorithms, guided by diverse data augmentations. CARE leverages parameter-efficient fine-tuning and an automatic LLM-as-judge evaluation to demonstrate state-of-the-art performance on benchmark networks, including scenarios with permuted names and partial observations. The work shows that a compact LLM, when properly trained, can outperform traditional causal discovery methods and much larger models, highlighting a scalable path to robust causal reasoning in AI systems.
Abstract
Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs' behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs' performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs' causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.
