UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
Zelei Cheng, Xin-Qiang Cai, Yuting Tang, Pushi Zhang, Boming Yang, Masashi Sugiyama, Xinyu Xing
TL;DR
This work tackles aligning LLMs to multi-objective user preferences by moving away from direct numerical reward targets toward a library of non-linear, strictly increasing utility functions. By conditioning on symbolic utility indices rather than raw scores, UC-MOA enables a single model to approximate distributional Pareto-optimal trade-offs with improved efficiency. The approach combines offline and online training, percentile-based utility selection, and two inference modes to realize robust, personalized response styles, backed by theoretical guarantees and extensive evaluation including a user study. Practically, UC-MOA achieves superior Pareto fronts and favorable human judgments while reducing computational overhead relative to existing multi-objective personalization methods.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.
