Multi-Agent Causal Discovery Using Large Language Models

Hao Duong Le; Xin Xia; Zhang Chen

Multi-Agent Causal Discovery Using Large Language Models

Hao Duong Le, Xin Xia, Zhang Chen

TL;DR

MAC introduces a two-module, multi-agent LLM framework for causal discovery that jointly exploits structured data and metadata. The Debate-Coding Module (DCM) selects and executes a suitable statistical causal discovery method on structured data, creating an initial graph that is converted into causal metadata via Meta Fusion. The Meta-Debate Module (MDM) then refines the graph through adversarial, multi-agent debates among a Causal Affirmative Debater, a Causal Negative Debater, and a Causal Judge, guided by the CMD and domain knowledge. Across five datasets, MAC outperforms traditional SCD methods and prior LLM-based approaches, with strong performance on diverse domains and evidence that metadata-driven refinement improves causal inference. The framework highlights the value of integrating knowledge-driven LLM reasoning with data-driven causal discovery for scalable, interpretable causal graph learning, albeit with some computational overhead and observational-data limitations.

Abstract

Causal discovery aims to identify causal relationships between variables and is a critical research area in machine learning. Traditional methods focus on statistical or machine learning algorithms to uncover causal links from structured data, often overlooking the valuable contextual information provided by metadata. Large language models (LLMs) have shown promise in creating unified causal discovery frameworks by incorporating both structured data and metadata. However, their potential in multi-agent settings remains largely unexplored. To address this gap, we introduce the Multi-Agent Causal Discovery Framework (MAC), which consists of two key modules: the Debate-Coding Module (DCM) and the Meta-Debate Module (MDM). The DCM begins with a multi-agent debating and coding process, where agents use both structured data and metadata to collaboratively select the most suitable statistical causal discovery (SCD) method. The selected SCD is then applied to the structured data to generate an initial causal graph. This causal graph is transformed into causal metadata through the Meta Fusion mechanism. With all the metadata, MDM then refines the causal structure by leveraging a multi-agent debating framework. Extensive experiments across five datasets demonstrate that MAC outperforms both traditional statistical causal discovery methods and existing LLM-based approaches, achieving state-of-the-art performance.

Multi-Agent Causal Discovery Using Large Language Models

TL;DR

Abstract

Paper Structure (25 sections, 6 figures, 7 tables, 3 algorithms)

This paper contains 25 sections, 6 figures, 7 tables, 3 algorithms.

Introduction
Related Works
Methodology
Problem Definition
Meta-Debate Module (MDM)
Debate-Coding Module (DCM)
MAC: Multi-Agent Causal Discovery Framework
Experiment
Experimental Setup
Implementation Details
Overall Results (R1)
Quantitative Analysis
Conclusion
Limitations
Prompting of Causal Agent
...and 10 more sections

Figures (6)

Figure 1: Meta-Debate Module: A structured meta-question-driven framework where affirmative and negative agents debate causal relationships, evaluated by a causal judge. The judge synthesizes diverse perspectives and delivers a final decision.
Figure 2: Debate-Coding Module: A two-phase module where agents debate to select the optimal statistical algorithm for causal discovery using a similar approach to the Meta-Debate Module. Next, the causal coding executor will execute the proposed algorithm, with the observational data.
Figure 3: Overall performance comparison among baselines and our methods on 5 datasets (Lower SHD and NHD are better, high F1 is better). We report each metric's mean and standard deviation over 3 random seeds. The details of the result can refer to Table \ref{['tab:comparison_part1']} for GPT-4o, Table \ref{['deepseek']} for DeepSeek-R1, and Table \ref{['gemini']} for Gemini-2.0-Flash.
Figure 4: Comparison of SHD, NHD, and F1 scores across different datasets for single DCM, single MDM, and MAC.
Figure 5: Performance trends of SHD, NHD, and F1-score over five rounds across different datasets
...and 1 more figures

Multi-Agent Causal Discovery Using Large Language Models

TL;DR

Abstract

Multi-Agent Causal Discovery Using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)