Table of Contents
Fetching ...

Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models

Biao Liu, Ning Xu, Junming Yang, Xin Geng

TL;DR

The work tackles multi-objective alignment for large language models by addressing the burden and inefficiency of manually specified, fixed preference weights. It introduces PRO, a lightweight Preference Orchestrator that maps each input prompt $\mathbf{x}$ to a context-specific weight vector $\boldsymbol{w} \in \Delta^{K-1}$, learned from normalized rewards across $K$ objectives via a KL-based objective $\mathcal{L}_{\text{Pro}}$ and softmax-targets $\boldsymbol{w}^*$. The adapter can be integrated with existing methods (e.g., MoRlhf) and supports offline weight conditioning and online, prompt-specific weighting at inference time, enabling adaptive trade-offs without exhaustive manual tuning. Theoretical analysis shows adaptive weighting reduces the alignment gap relative to fixed weights under standard optimization assumptions, and experiments on Reddit Summary, Helpful Assistant, and Ultrafeedback demonstrate that PRO achieves superior multi-objective performance and general capability compared to baselines. Overall, PRO provides a practical, theoretically grounded, and empirically effective solution for prompt-aware multi-objective alignment in LLMs.

Abstract

While Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, aligning these models with varying human preferences across multiple objectives remains a significant challenge in practical deployments. Existing multi-objective alignment methods rely on manually specified preference weights, which not only burden users with difficult preference specification tasks but also lead to suboptimal training efficiency due to exploration of irrelevant preference combinations. To alleviate these issues, we propose a novel framework named PRO, i.e., PReference Orchestrator, which features a lightweight preference adapter that automatically infers prompt-specific preference weights during both training and deployment phases. Specifically, the adapter automatically learns appropriate preference weights for each prompt by training on normalized reward scores from multiple reward models for preferred responses, which inherently reflect effective preference balances across objectives. Additionally, We provide theoretical analysis proving that our prompt-aware preference mechanism achieves superior performance compared to fixed preference weights in multi-objective alignment scenarios. Extensive experiments across multiple tasks demonstrate the effectiveness of our method over existing multi-objective alignment approaches.

Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models

TL;DR

The work tackles multi-objective alignment for large language models by addressing the burden and inefficiency of manually specified, fixed preference weights. It introduces PRO, a lightweight Preference Orchestrator that maps each input prompt to a context-specific weight vector , learned from normalized rewards across objectives via a KL-based objective and softmax-targets . The adapter can be integrated with existing methods (e.g., MoRlhf) and supports offline weight conditioning and online, prompt-specific weighting at inference time, enabling adaptive trade-offs without exhaustive manual tuning. Theoretical analysis shows adaptive weighting reduces the alignment gap relative to fixed weights under standard optimization assumptions, and experiments on Reddit Summary, Helpful Assistant, and Ultrafeedback demonstrate that PRO achieves superior multi-objective performance and general capability compared to baselines. Overall, PRO provides a practical, theoretically grounded, and empirically effective solution for prompt-aware multi-objective alignment in LLMs.

Abstract

While Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, aligning these models with varying human preferences across multiple objectives remains a significant challenge in practical deployments. Existing multi-objective alignment methods rely on manually specified preference weights, which not only burden users with difficult preference specification tasks but also lead to suboptimal training efficiency due to exploration of irrelevant preference combinations. To alleviate these issues, we propose a novel framework named PRO, i.e., PReference Orchestrator, which features a lightweight preference adapter that automatically infers prompt-specific preference weights during both training and deployment phases. Specifically, the adapter automatically learns appropriate preference weights for each prompt by training on normalized reward scores from multiple reward models for preferred responses, which inherently reflect effective preference balances across objectives. Additionally, We provide theoretical analysis proving that our prompt-aware preference mechanism achieves superior performance compared to fixed preference weights in multi-objective alignment scenarios. Extensive experiments across multiple tasks demonstrate the effectiveness of our method over existing multi-objective alignment approaches.

Paper Structure

This paper contains 18 sections, 1 theorem, 27 equations, 4 figures, 6 tables.

Key Result

Theorem 5.1

Let $\pi_{\text{fixed}}$ be the optimal policy trained with fixed weights $\bm{w}_{\text{fixed}}$, and $\pi_{\text{adapt}}$ be the policy optimized using our Preference Orchestrator$f_\psi$. Under the following assumptions: then the alignment gaps satisfy: with probability at least $1 - \delta$, where $N$ is the number of training samples of the Preference Orchestrator.

Figures (4)

  • Figure 1: Overview of the Pro architecture. The adapter takes an input prompt and outputs a weight vector that determines how to combine multiple reward objectives for that specific context.
  • Figure 2: Reddit Summary
  • Figure 3: Helpful Assistant
  • Figure 5: Training reward curves comparing Pro-MoRlhf and Ppo on Ultrafeedback dataset.

Theorems & Definitions (2)

  • Theorem 5.1: Superiority of Adaptive Weights
  • Remark 5.2