Table of Contents
Fetching ...

Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration

Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang

TL;DR

Contextual Counterfactual Credit Assignment (C3) isolates the causal impact of individual messages by freezing the exact transcript-derived context, evaluating context-matched alternatives via fixed-continuation replay, and applying a leave-one-out (LOO) baseline.

Abstract

Cooperative multi-agent reinforcement learning (MARL) systems powered by large language models (LLMs) are frequently optimized via sparse terminal-only feedback. This shared signal entangles upstream decisions, obstructing accurate decision-level credit assignment. To address this trajectory-level diffusion, we introduce Contextual Counterfactual Credit Assignment (\textbf{\texttt{C3}}). Instead of distributing rewards across an entire episode, \textbf{\texttt{C3}} isolates the causal impact of individual messages by freezing the exact transcript-derived context, evaluating context-matched alternatives via fixed-continuation replay, and applying a leave-one-out (LOO) baseline. This localized intervention extracts unbiased, low-variance marginal advantages for standard policy-gradient optimization. Evaluated across five mathematical and coding benchmarks under matched budgets, \textbf{\texttt{C3}} improves terminal performance over established baselines. Mechanistic diagnostics further show that these gains are accompanied by higher credit fidelity, lower contextual variance, and stronger inter-agent causal dependence. Our code is available at https://github.com/EIT-EAST-Lab/C3.

Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration

TL;DR

Contextual Counterfactual Credit Assignment (C3) isolates the causal impact of individual messages by freezing the exact transcript-derived context, evaluating context-matched alternatives via fixed-continuation replay, and applying a leave-one-out (LOO) baseline.

Abstract

Cooperative multi-agent reinforcement learning (MARL) systems powered by large language models (LLMs) are frequently optimized via sparse terminal-only feedback. This shared signal entangles upstream decisions, obstructing accurate decision-level credit assignment. To address this trajectory-level diffusion, we introduce Contextual Counterfactual Credit Assignment (\textbf{\texttt{C3}}). Instead of distributing rewards across an entire episode, \textbf{\texttt{C3}} isolates the causal impact of individual messages by freezing the exact transcript-derived context, evaluating context-matched alternatives via fixed-continuation replay, and applying a leave-one-out (LOO) baseline. This localized intervention extracts unbiased, low-variance marginal advantages for standard policy-gradient optimization. Evaluated across five mathematical and coding benchmarks under matched budgets, \textbf{\texttt{C3}} improves terminal performance over established baselines. Mechanistic diagnostics further show that these gains are accompanied by higher credit fidelity, lower contextual variance, and stronger inter-agent causal dependence. Our code is available at https://github.com/EIT-EAST-Lab/C3.
Paper Structure (55 sections, 35 equations, 6 figures, 17 tables, 1 algorithm)

This paper contains 55 sections, 35 equations, 6 figures, 17 tables, 1 algorithm.

Figures (6)

  • Figure 1: Left: A collaborative episode formulated as an acyclic decision event graph. Each node conditions on a deterministic transcript-derived context and emits a single textual macro-action. An external evaluator issues a terminal score upon episode completion. Right: The operational pipeline of C3. Context instances are constructed from recorded replay states. Context-matched alternative actions are sampled from a frozen behavior policy $\pi_b$. Fixed-continuation replays under $\mathcal{D}_b$ estimate expected counterfactual returns. A LOO baseline across identical contexts extracts low-variance advantages for policy-gradient optimization.
  • Figure 2: Training dynamics on Qwen3-4B-Instruct-2507, math suite. Training episodic return versus cumulative terminal evaluator calls. C3 reaches a higher plateau earlier, with lower variance across seeds.
  • Figure 3: Training efficiency pareto on Qwen3-4B-Instruct-2507, math suite. Training episodic return versus cumulative training tokens. C3 reaches a favorable Pareto trade-off, converging with a fraction of the token budget used by the trajectory-level baselines.
  • Figure 4: Mechanistic validation on Qwen3 math, tracking fidelity, variance, and emergent influence.
  • Figure 5: Qwen2.5-3B-Instruct, mathematical suite. Left: training episodic return mapped against terminal scoring events. Right: training episodic return mapped against cumulative training tokens. Shaded bands depict 95% confidence intervals across five seeds. C3 achieves higher returns while using fewer training tokens.
  • ...and 1 more figures