Table of Contents
Fetching ...

Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

Yanming Li, Xuelin Zhang, WenJie Lu, Ziye Tang, Maodong Wu, Haotian Luo, Tongtong Wu, Zijie Peng, Hongze Mi, Yibo Feng, Naiqiang Tan, Chao Huang, Hong Chen, Li Shen

TL;DR

The paper tackles creditassignment in toolaugmented multiagent LLM systems by introducing SHARP, a Shapleybased hierarchical attribution framework that decomposes rewards into global accuracy, marginalcredit, and toolprocess signals. It employs counterfactual masking and grouprelative policy gradients (GRPO) to stabilize training and align planning and execution. Across diverse realworld benchmarks, SHARP achieves significant performance gains, demonstrates robust scalability with model size, and reveals improved plannerworker coordination and reduced harmful interactions. The approach provides a principled, interpretable foundation for scalable, crosstask multiagent optimization in complex decisionmaking scenarios.

Abstract

Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contributions and leading to inefficient reinforcement learning. To address these limitations, we introduce the Shapley-based Hierarchical Attribution for Reinforcement Policy (SHARP), a novel framework for optimizing multi-agent reinforcement learning via precise credit attribution. SHARP effectively stabilizes training by normalizing agent-specific advantages across trajectory groups, primarily through a decomposed reward mechanism comprising a global broadcast-accuracy reward, a Shapley-based marginal-credit reward for each agent, and a tool-process reward to improve execution efficiency. Extensive experiments across various real-world benchmarks demonstrate that SHARP significantly outperforms recent state-of-the-art baselines, achieving average match improvements of 23.66% and 14.05% over single-agent and multi-agent approaches, respectively.

Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

TL;DR

The paper tackles creditassignment in toolaugmented multiagent LLM systems by introducing SHARP, a Shapleybased hierarchical attribution framework that decomposes rewards into global accuracy, marginalcredit, and toolprocess signals. It employs counterfactual masking and grouprelative policy gradients (GRPO) to stabilize training and align planning and execution. Across diverse realworld benchmarks, SHARP achieves significant performance gains, demonstrates robust scalability with model size, and reveals improved plannerworker coordination and reduced harmful interactions. The approach provides a principled, interpretable foundation for scalable, crosstask multiagent optimization in complex decisionmaking scenarios.

Abstract

Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contributions and leading to inefficient reinforcement learning. To address these limitations, we introduce the Shapley-based Hierarchical Attribution for Reinforcement Policy (SHARP), a novel framework for optimizing multi-agent reinforcement learning via precise credit attribution. SHARP effectively stabilizes training by normalizing agent-specific advantages across trajectory groups, primarily through a decomposed reward mechanism comprising a global broadcast-accuracy reward, a Shapley-based marginal-credit reward for each agent, and a tool-process reward to improve execution efficiency. Extensive experiments across various real-world benchmarks demonstrate that SHARP significantly outperforms recent state-of-the-art baselines, achieving average match improvements of 23.66% and 14.05% over single-agent and multi-agent approaches, respectively.
Paper Structure (44 sections, 24 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 44 sections, 24 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Existing credit assignment policy for all agents (left) and the precise strategy of SHARP for each individual agent (right).
  • Figure 2: Overview of SHARP framework. The pipeline involves (a) hierarchical interaction between planner and worker agents via a shared policy; (b) tripartite reward system integrating global accuracy, marginal credit, and tool process rewards; (c) marginal credit mechanism isolating agents' contribution via Shapley values; (d) SHARP workflow using group-relative policy for stable alignment.
  • Figure 3: Left: Ablation studies on MuSiQue and GAIA-text comparing full SHARP with variants that remove planner-level or worker-level Shapley credit. Middle: The corresponding accuracy differences ($\Delta$ Accuracy) measured relative to the no-Shapley baseline on each benchmark. Right: Evaluation on DocMath-Eval across four document-level reasoning settings, including Simple-Short (SS), Simple-Long (SL), Complex-Short (CS), and Complex-Long (CL).
  • Figure 4: Parameter scalability on MuSiQue from 0.6B to 8B. SHARP shows consistent improvement as the model size increases and achieves a larger advantage over the baselines at larger scales.
  • Figure 5: Training-step scalability on GAIA-text from 0 to 180 steps. SHARP improves steadily as training progresses and avoids the instability observed in the baseline; shaded areas denote 95% confidence intervals.
  • ...and 2 more figures