Table of Contents
Fetching ...

Reward Model Routing in Alignment

Xinle Wu, Yao Lu

TL;DR

BayesianRouter tackles the challenge of adaptive reward-model routing in RLHF/RLAIF by marrying an offline RM-strength router trained on preference data with an online Bayesian Thompson sampling router. The offline component provides a principled prior over RM competencies via a Bradley–Terry embedding, while the online component updates RM posteriors to adapt to evolving policy distributions, all while maintaining $O(1)$ per-query RM calls. Empirical results across instruction-following and reasoning benchmarks show BayesianRouter outperforms single RMs, RM ensembles, and prior routing methods, with ablations validating the importance of both offline priors and online adaptation. The work offers a scalable, uncertainty-aware routing paradigm that improves alignment efficiency and robustness, with potential extensions to cost-aware routing under computational budgets.

Abstract

Reinforcement learning from human or AI feedback (RLHF / RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routing--dynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining $O(1)$ RM calls--but existing methods suffer from cold-start and insufficient exploration. We propose BayesianRouter, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that BayesianRouter consistently outperforms individual RMs, RM ensembling, and existing routing methods.

Reward Model Routing in Alignment

TL;DR

BayesianRouter tackles the challenge of adaptive reward-model routing in RLHF/RLAIF by marrying an offline RM-strength router trained on preference data with an online Bayesian Thompson sampling router. The offline component provides a principled prior over RM competencies via a Bradley–Terry embedding, while the online component updates RM posteriors to adapt to evolving policy distributions, all while maintaining per-query RM calls. Empirical results across instruction-following and reasoning benchmarks show BayesianRouter outperforms single RMs, RM ensembles, and prior routing methods, with ablations validating the importance of both offline priors and online adaptation. The work offers a scalable, uncertainty-aware routing paradigm that improves alignment efficiency and robustness, with potential extensions to cost-aware routing under computational budgets.

Abstract

Reinforcement learning from human or AI feedback (RLHF / RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routing--dynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining RM calls--but existing methods suffer from cold-start and insufficient exploration. We propose BayesianRouter, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that BayesianRouter consistently outperforms individual RMs, RM ensembling, and existing routing methods.

Paper Structure

This paper contains 40 sections, 12 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of BayesianRouter.
  • Figure 2: Training efficiency.