Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents
Selim Furkan Tekin, Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Ling Liu
TL;DR
RL-Focal tackles the instability of static LLM ensembles by introducing a two-stage RL framework that dynamically routes to a small, task-specific ensemble and fuses their outputs. By modeling the problem as a DEC-POMDP and optimizing with a centralized critic (MAPPO), the Decider prunes ensembles using focal diversity, while the Fusion agent resolves conflicts among ensemble members. The key contributions are the task-adaptive ensemble pruning via focal diversity, the two-stage RL architecture, and theoretical and empirical demonstrations of improved generalization and robustness across five benchmarks. Practically, RL-Focal offers a scalable, cost-efficient approach to harness heterogeneous LLMs for diverse downstream tasks, with open-source code enabling reproducibility and reuse.
Abstract
The advancement of LLMs and their accessibility have triggered renewed interest in multi-agent reinforcement learning as robust and adaptive frameworks for dynamically changing environments. This paper introduces RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs. First, we develop the Decider RL-agent, which learns to dynamically select an ensemble of small size ($m_i$) among $N$ LLMs ($m_i \ll N$) for incoming queries from a user-defined downstream task $i$, by maximizing both error-diversity and reasoning-performance of the selected ensemble through iterative updates of task-adaptive rewards and policy. Second, to enable effective fusion of dynamically selected LLMs, we develop the stage-2 Fusion RL-agent, which learns to resolve reasoning conflicts from different LLMs and dynamically adapts to different ensemble teams composed by the Decider Agent for different downstream tasks. Third, we introduce the focal diversity metric to better model the error correlations among multiple LLMs, further improving the generalization performance of the Decider Agent, which actively prunes the ensemble combinations. By focal diversity, we enhance performance across tasks by effectively promoting reward-aware and policy-adaptive ensemble selection and inference fusion. Extensive evaluations on five benchmarks show that RL-Focal achieves the performance improvement of 8.48\% with an ensemble of small size compared to the best individual LLM in a pool and offers stronger robustness. Code is available at https://github.com/sftekin/rl-focal
