Table of Contents
Fetching ...

Orchestrating LLMs with Different Personalizations

Jin Peng Zhou, Katie Z Luo, Jingwen Gu, Jason Yuan, Kilian Q. Weinberger, Wen Sun

TL;DR

This work tackles personalization of large language models to individual user preferences without retraining. It introduces Mixture of Preference Experts (MoPE), a black-box, token-level output-merging framework where a lightweight Preference Control Model (PCM) assigns per-token weights to combine the next-token distributions of frozen expert LLMs. Rewards for each preference dimension are modeled with Bradley-Terry normalization and optimized via online reinforcement learning (REBEL) to maximize multi-dimensional utility. Empirical results on the Koala dataset show MoPE achieves state-of-the-art performance compared with prompting and weight-merging baselines, offering scalable and practical personalization for proprietary or closed models.

Abstract

This paper presents a novel approach to aligning large language models (LLMs) with individual human preferences, sometimes referred to as Reinforcement Learning from \textit{Personalized} Human Feedback (RLPHF). Given stated preferences along multiple dimensions, such as helpfulness, conciseness, or humor, the goal is to create an LLM without re-training that best adheres to this specification. Starting from specialized expert LLMs, each trained for one such particular preference dimension, we propose a black-box method that merges their outputs on a per-token level. We train a lightweight Preference Control Model (PCM) that dynamically translates the preference description and current context into next-token prediction weights. By combining the expert models' outputs at the token level, our approach dynamically generates text that optimizes the given preference. Empirical tests show that our method matches or surpasses existing preference merging techniques, providing a scalable, efficient alternative to fine-tuning LLMs for individual personalization.

Orchestrating LLMs with Different Personalizations

TL;DR

This work tackles personalization of large language models to individual user preferences without retraining. It introduces Mixture of Preference Experts (MoPE), a black-box, token-level output-merging framework where a lightweight Preference Control Model (PCM) assigns per-token weights to combine the next-token distributions of frozen expert LLMs. Rewards for each preference dimension are modeled with Bradley-Terry normalization and optimized via online reinforcement learning (REBEL) to maximize multi-dimensional utility. Empirical results on the Koala dataset show MoPE achieves state-of-the-art performance compared with prompting and weight-merging baselines, offering scalable and practical personalization for proprietary or closed models.

Abstract

This paper presents a novel approach to aligning large language models (LLMs) with individual human preferences, sometimes referred to as Reinforcement Learning from \textit{Personalized} Human Feedback (RLPHF). Given stated preferences along multiple dimensions, such as helpfulness, conciseness, or humor, the goal is to create an LLM without re-training that best adheres to this specification. Starting from specialized expert LLMs, each trained for one such particular preference dimension, we propose a black-box method that merges their outputs on a per-token level. We train a lightweight Preference Control Model (PCM) that dynamically translates the preference description and current context into next-token prediction weights. By combining the expert models' outputs at the token level, our approach dynamically generates text that optimizes the given preference. Empirical tests show that our method matches or surpasses existing preference merging techniques, providing a scalable, efficient alternative to fine-tuning LLMs for individual personalization.
Paper Structure (28 sections, 3 equations, 2 figures, 7 tables)

This paper contains 28 sections, 3 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of MoPE for generating personalized responses. Given an instruction and a preference from the user, MoPE iteratively generates a response by sending the instruction and current generation to relevant black-box experts (frozen) that optimize individual dimensions from the preference. At the same time, a trainable smaller Preference Control Model (PCM) learns to outputs a list of weights to merge the next token probability distributions from the experts. A new token is sampled from the mixture distribution. The process continues until an EOS token is generated. Frozen models are denoted with a snowflake, and the trained model is denoted with a flame.
  • Figure 2: Overview of MoPE training. Given a preference and instruction, MoPE generates a response from the relevant experts and merging weights from the preference control model as shown in Figure \ref{['fig:model-figure']}. This output is evaluated against a reference response along all individual dimensions by the corresponding reward models (i.e. "helpfulness", "conciseness", and "humour") under the Bradley-Terry modeling. The final averaged reward is used to update the weights of the preference control model. Frozen models are denoted with a snowflake, and the trained model is denoted with a flame.