Table of Contents
Fetching ...

Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents

Selim Furkan Tekin, Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Ling Liu

TL;DR

RL-Focal tackles the instability of static LLM ensembles by introducing a two-stage RL framework that dynamically routes to a small, task-specific ensemble and fuses their outputs. By modeling the problem as a DEC-POMDP and optimizing with a centralized critic (MAPPO), the Decider prunes ensembles using focal diversity, while the Fusion agent resolves conflicts among ensemble members. The key contributions are the task-adaptive ensemble pruning via focal diversity, the two-stage RL architecture, and theoretical and empirical demonstrations of improved generalization and robustness across five benchmarks. Practically, RL-Focal offers a scalable, cost-efficient approach to harness heterogeneous LLMs for diverse downstream tasks, with open-source code enabling reproducibility and reuse.

Abstract

The advancement of LLMs and their accessibility have triggered renewed interest in multi-agent reinforcement learning as robust and adaptive frameworks for dynamically changing environments. This paper introduces RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs. First, we develop the Decider RL-agent, which learns to dynamically select an ensemble of small size ($m_i$) among $N$ LLMs ($m_i \ll N$) for incoming queries from a user-defined downstream task $i$, by maximizing both error-diversity and reasoning-performance of the selected ensemble through iterative updates of task-adaptive rewards and policy. Second, to enable effective fusion of dynamically selected LLMs, we develop the stage-2 Fusion RL-agent, which learns to resolve reasoning conflicts from different LLMs and dynamically adapts to different ensemble teams composed by the Decider Agent for different downstream tasks. Third, we introduce the focal diversity metric to better model the error correlations among multiple LLMs, further improving the generalization performance of the Decider Agent, which actively prunes the ensemble combinations. By focal diversity, we enhance performance across tasks by effectively promoting reward-aware and policy-adaptive ensemble selection and inference fusion. Extensive evaluations on five benchmarks show that RL-Focal achieves the performance improvement of 8.48\% with an ensemble of small size compared to the best individual LLM in a pool and offers stronger robustness. Code is available at https://github.com/sftekin/rl-focal

Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents

TL;DR

RL-Focal tackles the instability of static LLM ensembles by introducing a two-stage RL framework that dynamically routes to a small, task-specific ensemble and fuses their outputs. By modeling the problem as a DEC-POMDP and optimizing with a centralized critic (MAPPO), the Decider prunes ensembles using focal diversity, while the Fusion agent resolves conflicts among ensemble members. The key contributions are the task-adaptive ensemble pruning via focal diversity, the two-stage RL architecture, and theoretical and empirical demonstrations of improved generalization and robustness across five benchmarks. Practically, RL-Focal offers a scalable, cost-efficient approach to harness heterogeneous LLMs for diverse downstream tasks, with open-source code enabling reproducibility and reuse.

Abstract

The advancement of LLMs and their accessibility have triggered renewed interest in multi-agent reinforcement learning as robust and adaptive frameworks for dynamically changing environments. This paper introduces RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs. First, we develop the Decider RL-agent, which learns to dynamically select an ensemble of small size () among LLMs () for incoming queries from a user-defined downstream task , by maximizing both error-diversity and reasoning-performance of the selected ensemble through iterative updates of task-adaptive rewards and policy. Second, to enable effective fusion of dynamically selected LLMs, we develop the stage-2 Fusion RL-agent, which learns to resolve reasoning conflicts from different LLMs and dynamically adapts to different ensemble teams composed by the Decider Agent for different downstream tasks. Third, we introduce the focal diversity metric to better model the error correlations among multiple LLMs, further improving the generalization performance of the Decider Agent, which actively prunes the ensemble combinations. By focal diversity, we enhance performance across tasks by effectively promoting reward-aware and policy-adaptive ensemble selection and inference fusion. Extensive evaluations on five benchmarks show that RL-Focal achieves the performance improvement of 8.48\% with an ensemble of small size compared to the best individual LLM in a pool and offers stronger robustness. Code is available at https://github.com/sftekin/rl-focal

Paper Structure

This paper contains 21 sections, 31 equations, 6 figures, 10 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of RL-Focal two stage ensemble by reinforcement learning agents.
  • Figure 2: All candidate ensemble teams from the model pool are plotted with their focal diversity scores, Fleiss Kappa, and Accuracy using the 4 popular LLM evaluation datasets. We use cubic interpolation to create a surface, and the dark red represents a higher performance score.
  • Figure 3: The first two plots from left show performance for Decider and Fusion agents for each dataset. The shaded regions represent the one standard deviation distance to the mean for 5 experiments. The last two plots show how diversity metrics affect the performance and cost of the RL system on the GSM8k dataset.
  • Figure 4: The first plot shows how often RL-Focal is correct when exactly $n$ base models are correct (x-axis). The plot in the middle shows the performance of RL-Focal compared to two greedy approaches. The third plot shows how often RL-Focal corrects simultaneous errors made by top-performing base models. The last plot shows the improvement by the branching design at Decider Agent.
  • Figure 5: We show the effect of $\alpha$ to the performance and cost of RL-Focal
  • ...and 1 more figures