Table of Contents
Fetching ...

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu

Abstract

Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Abstract

Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

Paper Structure

This paper contains 54 sections, 3 theorems, 26 equations, 5 figures, 6 tables, 1 algorithm.

Key Result

Theorem F.1

Let $Q_{\text{tot}}(\boldsymbol{\tau},\mathbf{u},s) = f_\psi\bigl(Q_1(\tau^1,u^1),\ldots,Q_N(\tau^N,u^N);\,s\bigr)$ where $f_\psi$ is the QMIX mixing network parameterized by $\psi$. If the mixing weights satisfy $W_k = |\mathrm{HyperNet}_k(s)| \ge 0$ for all layers $k$, then and consequently $Q_{\text{tot}}$ satisfies the IGM property. $\blacktriangleleft$$\blacktriangleleft$

Figures (5)

  • Figure 1: Overview of Agent Q-Mix. Agents select communication actions, induce a round-wise communication graph, exchange messages over that graph, and produce a final team output under centralized training with decentralized execution.
  • Figure 2: Illustration of the six discrete communication actions in the Agent Q-Mix action space (Section \ref{['sec:action_space']}). Each panel depicts the induced graph structure for a representative multi-agent team. Arrows indicate the direction of information flow. From left to right: (a) solo_process---the agent works independently with no outgoing edges; (b) broadcast_all---the agent sends its message to all other agents; (c) selective_query---the agent sends a targeted query to one specific neighbor; (d) aggregate_refine---the agent collects responses from all others through incoming edges; (e) execute_verify---the agent forwards its output to the next agent for verification with minimal communication overhead; (f) debate_check---two agents engage in mutual exchange via bidirectional edges. These six primitives span a range of connectivity levels from isolated nodes to dense cliques, enabling the QMIX policy to compose task-adaptive topologies.
  • Figure 3: Case study of Agent Q-Mix showing a representative communication and reasoning trajectory across rounds. In practice, mathematics tasks typically converge in $T=3$ communication rounds, while coding and reasoning tasks require $T=2$ rounds (see the ablation on communication rounds in Figure \ref{['fig:ablation']}d and Appendix \ref{['app:training']}).
  • Figure 4: Humanity's Last Exam (HLE). Accuracy (%) on the first 250 MCQ items of HLE using Gemini-3.1-Flash-Lite. Agent Q-Mix (20.8%) outperforms Microsoft Agent Framework (19.2%), LangGraph (19.2%), AutoGen, and Lobster, while using significantly lower tokens.
  • Figure 5: Ablation studies on Gemini-3.1-Flash-Lite. (a) Number of agents vs. accuracy on Beyond-AIME, average token cost, and HumanEval. (b) Number of training examples vs. accuracy. (c) Accuracy reward weight $w_{\mathrm{acc}}$ vs. accuracy and average token cost (with $w_{\mathrm{tok}}=0.075$). (d) Number of communication rounds vs. accuracy.

Theorems & Definitions (7)

  • Definition F.1: Individual-Global-Max rashid2018qmix
  • Theorem F.1: Monotonicity $\Rightarrow$ IGM
  • proof
  • Theorem F.2: Convergence of QMIX TD Iterates
  • proof
  • Corollary F.1: Eventual Optimality Under an Action Gap
  • proof