Table of Contents
Fetching ...

CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate

Finn G. Vamosi, Nils D. Forkert

TL;DR

The paper introduces a dual-agent debate framework for causal reasoning, leveraging Reasoning Language Models to perform deliberative, adversarial discussions over causal queries. By applying this approach to the CLadder dataset with open-source models Qwen3-32B and DeepSeek-R1-Distill-Qwen-32B, the authors demonstrate meaningful accuracy gains, especially on counterfactual (Rung-3) questions, and provide a comprehensive analysis of persuasion dynamics, confidence calibration, and dialogue efficiency. The work establishes that multi-agent debate can serve as a powerful building block for improving causal inference in AI systems, while offering a detailed evaluation and a baseline for future multi-agent causal inference research. Limitations include dataset synthetic nature, potential training data leakage, and evaluation restricted to a single model pair; future work is directed at scaling, including more agents and ablation studies.

Abstract

When people reason about cause and effect, they often consider many competing "what if" scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual-agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, agents attempt to persuade each other, challenging each other's logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under-explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl's ladder of causation. With Qwen3 and DeepSeek-R1 as debater agents, we demonstrate that multi-agent debate improves DeepSeek-R1's overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3's overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi-agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem-solving.

CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate

TL;DR

The paper introduces a dual-agent debate framework for causal reasoning, leveraging Reasoning Language Models to perform deliberative, adversarial discussions over causal queries. By applying this approach to the CLadder dataset with open-source models Qwen3-32B and DeepSeek-R1-Distill-Qwen-32B, the authors demonstrate meaningful accuracy gains, especially on counterfactual (Rung-3) questions, and provide a comprehensive analysis of persuasion dynamics, confidence calibration, and dialogue efficiency. The work establishes that multi-agent debate can serve as a powerful building block for improving causal inference in AI systems, while offering a detailed evaluation and a baseline for future multi-agent causal inference research. Limitations include dataset synthetic nature, potential training data leakage, and evaluation restricted to a single model pair; future work is directed at scaling, including more agents and ablation studies.

Abstract

When people reason about cause and effect, they often consider many competing "what if" scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual-agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, agents attempt to persuade each other, challenging each other's logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under-explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl's ladder of causation. With Qwen3 and DeepSeek-R1 as debater agents, we demonstrate that multi-agent debate improves DeepSeek-R1's overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3's overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi-agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem-solving.

Paper Structure

This paper contains 20 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Example of a causal inference debate between two reasoning language models. Qwen3 is randomly selected as the first speaker and successfully persuades DeepSeek-R1 to revise its initial conclusion, converging on the correct answer (“no”).
  • Figure 2: For questions with initial disagreement, debate improves answers far more often than it worsens them.
  • Figure 3: Initial confidence of both models combined, for each Rung. The models become less confident as Rungs get more complicated, and generally are just as confident in their incorrect answers as they are in their correct answers.
  • Figure 4: Models are more likely to be persuaded to change their answer when their opponent is more confident.
  • Figure 5: When defending their answer, models do not express uncertainty after facing criticism. However, if persuaded to change their answer, they often become more confident in their opponent's answer.
  • ...and 3 more figures