Table of Contents
Fetching ...

QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

Zhouyang Jiang, Bin Zhang, Yuanjun Li, Zhiwei Xu

TL;DR

This work tackles credit assignment in cooperative multi-agent reinforcement learning by replacing traditional mixing networks with a Training-Free Credit Assignment Function (TFCAF) generated by Large Language Models. A novel coder-evaluator framework enables zero-shot TFCAF construction with automated error checking and selection, while an IGM-Gating mechanism provides task-adaptive control over monotonicity constraints. Empirical results across MARL benchmarks show QLLM achieves superior performance and generalizes to various mixing-based algorithms, with reduced trainable parameters and enhanced interpretability. The approach promises scalable, human-readable credit attribution in complex multi-agent systems and broad compatibility with existing CTDE frameworks.

Abstract

Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL). Previous studies have primarily addressed this issue through value decomposition methods under the centralized training with decentralized execution paradigm, where neural networks are utilized to approximate the nonlinear relationship between individual Q-values and the global Q-value. Although these approaches have achieved considerable success in various benchmark tasks, they still suffer from several limitations, including imprecise attribution of contributions, limited interpretability, and poor scalability in high-dimensional state spaces. To address these challenges, we propose a novel algorithm, QLLM, which facilitates the automatic construction of credit assignment functions using large language models (LLMs). Specifically, the concept of TFCAF is introduced, wherein the credit allocation process is represented as a direct and expressive nonlinear functional formulation. A custom-designed coder-evaluator framework is further employed to guide the generation and verification of executable code by LLMs, significantly mitigating issues such as hallucination and shallow reasoning during inference. Furthermore, an IGM-Gating Mechanism enables QLLM to flexibly enforce or relax the monotonicity constraint depending on task demands, covering both IGM-compliant and non-monotonic scenarios. Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines. Moreover, QLLM exhibits strong generalization capability and maintains compatibility with a wide range of MARL algorithms that utilize mixing networks, positioning it as a promising and versatile solution for complex multi-agent scenarios. The code is available at https://github.com/zhouyangjiang71-sys/QLLM.

QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

TL;DR

This work tackles credit assignment in cooperative multi-agent reinforcement learning by replacing traditional mixing networks with a Training-Free Credit Assignment Function (TFCAF) generated by Large Language Models. A novel coder-evaluator framework enables zero-shot TFCAF construction with automated error checking and selection, while an IGM-Gating mechanism provides task-adaptive control over monotonicity constraints. Empirical results across MARL benchmarks show QLLM achieves superior performance and generalizes to various mixing-based algorithms, with reduced trainable parameters and enhanced interpretability. The approach promises scalable, human-readable credit attribution in complex multi-agent systems and broad compatibility with existing CTDE frameworks.

Abstract

Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL). Previous studies have primarily addressed this issue through value decomposition methods under the centralized training with decentralized execution paradigm, where neural networks are utilized to approximate the nonlinear relationship between individual Q-values and the global Q-value. Although these approaches have achieved considerable success in various benchmark tasks, they still suffer from several limitations, including imprecise attribution of contributions, limited interpretability, and poor scalability in high-dimensional state spaces. To address these challenges, we propose a novel algorithm, QLLM, which facilitates the automatic construction of credit assignment functions using large language models (LLMs). Specifically, the concept of TFCAF is introduced, wherein the credit allocation process is represented as a direct and expressive nonlinear functional formulation. A custom-designed coder-evaluator framework is further employed to guide the generation and verification of executable code by LLMs, significantly mitigating issues such as hallucination and shallow reasoning during inference. Furthermore, an IGM-Gating Mechanism enables QLLM to flexibly enforce or relax the monotonicity constraint depending on task demands, covering both IGM-compliant and non-monotonic scenarios. Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines. Moreover, QLLM exhibits strong generalization capability and maintains compatibility with a wide range of MARL algorithms that utilize mixing networks, positioning it as a promising and versatile solution for complex multi-agent scenarios. The code is available at https://github.com/zhouyangjiang71-sys/QLLM.

Paper Structure

This paper contains 35 sections, 11 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: A comparison between the traditional value decomposition method using a mixing network (Left) and our proposed novel paradigm QLLM leveraging LLMs (Right). The traditional approach employs a neural network to model the nonlinear relationship between local Q-values and the global Q-value, whereas QLLM capitalizes on the extensive knowledge encoded within LLMs to directly generate a training-free credit assignment function.
  • Figure 2: QLLM is designed within a coder-evaluator framework to autonomously generate high-quality, training-free credit assignment functions (TFCAFs). In each iteration, the coder LLM$M_{\text{coder}}$ produces $K$ candidate functions based on a task prompt and a role-specific instruction, referred to as the coder prompt. These candidate functions are initially assessed using environmental state information and local Q-values. If any runtime or syntax errors are identified, the faulty candidates are discarded and regenerated. Subsequently, the evaluator LLM$M_{\text{evaluator}}$ reviews the candidate functions, selects the most promising one based on its evaluator prompt This process is generated by LLMs through structured prompts without any need for human intervention.
  • Figure 3: Average episodic return curves for selected tasks in GRF and MPE environments (the results for LBF are shown in Appendix \ref{['sec:C']}).
  • Figure 4: Average episodic return curves in matrix-games environment (Right) and the payoff matrix used in this environment (Left).
  • Figure 5: Compatibility evaluation of QLLM in the MPE environments. The original mixing networks in RIIT, MASER, and QMIX are replaced by the TFCAF generated by QLLM, and the resulting performance is compared against the original implementations.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: IGM-Gating Mechanism
  • Definition 2