Table of Contents
Fetching ...

UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality

Zelei Cheng, Xin-Qiang Cai, Yuting Tang, Pushi Zhang, Boming Yang, Masashi Sugiyama, Xinyu Xing

TL;DR

This work tackles aligning LLMs to multi-objective user preferences by moving away from direct numerical reward targets toward a library of non-linear, strictly increasing utility functions. By conditioning on symbolic utility indices rather than raw scores, UC-MOA enables a single model to approximate distributional Pareto-optimal trade-offs with improved efficiency. The approach combines offline and online training, percentile-based utility selection, and two inference modes to realize robust, personalized response styles, backed by theoretical guarantees and extensive evaluation including a user study. Practically, UC-MOA achieves superior Pareto fronts and favorable human judgments while reducing computational overhead relative to existing multi-objective personalization methods.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.

UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality

TL;DR

This work tackles aligning LLMs to multi-objective user preferences by moving away from direct numerical reward targets toward a library of non-linear, strictly increasing utility functions. By conditioning on symbolic utility indices rather than raw scores, UC-MOA enables a single model to approximate distributional Pareto-optimal trade-offs with improved efficiency. The approach combines offline and online training, percentile-based utility selection, and two inference modes to realize robust, personalized response styles, backed by theoretical guarantees and extensive evaluation including a user study. Practically, UC-MOA achieves superior Pareto fronts and favorable human judgments while reducing computational overhead relative to existing multi-objective personalization methods.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.

Paper Structure

This paper contains 39 sections, 3 theorems, 22 equations, 8 figures, 4 tables.

Key Result

Theorem 1

Under Assumption ass:util_opt, for any generated response $y\!\sim\!\pi_{\theta,i}(x)$ with reward vector $\mathbf z(y)$, let Then the sub‑policy $\pi_{\theta,i^\star}$ is distributionally Pareto‑optimal.

Figures (8)

  • Figure 1: An illustration of the personalization mechanism employed by UC-MOA. Each user expresses a distinct preference over multiple objectives, such as humor and helpfulness as shown in the figure. These preferences are first mapped to a corresponding utility function from a learned library. The index of the selected utility function is then encoded as a symbolic token and incorporated into the input prompt. This conditioning guides the LLM in generating a response aligned with the target preference profile.
  • Figure 2: Pareto fronts of "harmless" vs "humor" and "deberta" vs. "faithful" tasks. Each marker on the plotted curves represents the average test rewards corresponding to a particular user preference. Note that a curve that covers a larger area indicates a better performance of the corresponding method.
  • Figure 3: An example of our prompt labeling technique in "harmless" vs. "helpful" dataset.
  • Figure 4: Utility value distribution for the "harmless" vs. "humor" task. The required utility index is 8 (i.e., the 9th utility function).
  • Figure 5: Pareto fronts of the remaining four tasks.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Definition 1: Expected utility
  • Theorem 1: Utility‑Conditional Distributional Pareto‑Optimality
  • Theorem 1: Utility-Conditional Distributional Pareto-Optimality
  • Definition 2: FOSD for distributions
  • Definition 3: FOSD for policies
  • Lemma 1: Expected-utility separation under FOSD
  • proof
  • proof