Table of Contents
Fetching ...

Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang

TL;DR

CoRL is proposed, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting and enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes.

Abstract

Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.

Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

TL;DR

CoRL is proposed, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting and enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes.

Abstract

Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.

Paper Structure

This paper contains 26 sections, 8 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of CoRL. We adopt a centralized multi-LLM architecture, where a controller LLM coordinates interactions with multiple expert LLMs. The system is trained via RL with dual rewards for task performance and multi-level query cost, while only the controller LLM is optimized for efficiency.
  • Figure 2: Performance–cost trade-off in the two-LLM system. x-axis: per-query cost; y-axis: task performance (higher is better). In the low-budget mode, CoRL primarily answers with the controller (Qwen2.5-7B-Instruct) and surpasses the controller-alone baseline on all four datasets. In the high-budget mode, CoRL leverages the expert (o3) and exceeds the single o3 baseline on three of the four datasets.
  • Figure 3: Ratio of expert LLM calls under different budget modes. Prompt A and Prompt B correspond to more constrained and more flexible system prompts, respectively. (1) For both prompt types, the expert call ratio follows the expected order of low < medium < high, learned through RL. (2) Overall, the ratio of expert calls increases as training progresses.
  • Figure 4: Calling ratio of different expert LLMs during training. Performance and cost ranking: o3 $>$ GPT-4.1 $>$ GPT-4.1-nano. (1) Under a high budget threshold ($B = 0.02$), the controller increasingly prioritizes o3 as training progresses. (2) Under a low budget threshold ($B = 0.001$), the system avoids over-reliance on o3 despite its stronger performance.
  • Figure 5: Training rewards. (1) The performance reward $r_p(\bm{x}, \bm{y})$ consistently increases under both budget settings as training progresses. (2) The cost reward $r_c(\bm{y})$ decreases over time, as the system hits the budget threshold more frequently when performance improves. (3) The overall reward $r_{\phi}(\bm{x}, \bm{y})$ rises steadily for $B = 0.02$ but fluctuates for $B = 0.001$, reflecting the interplay between performance and cost.
  • ...and 2 more figures