Table of Contents
Fetching ...

MOSLIM:Align with diverse preferences in prompts through reward classification

Yu Zhang, Wanli Jiang, Zhengyu Yang

TL;DR

This paper tackles multi-objective alignment of large language models (LLMs) to diverse human preferences, addressing the inefficiency of current approaches that rely on multiple reward models or preference-specific supervised fine-tuning. It introduces MOSLIM, a framework that uses a single multi-head reward model and a single policy model, with prompt-driven control and a reward-mapping mechanism to produce a scalar reward for optimization, eliminating the need for preference training during the SFT phase. The authors demonstrate that MOSLIM achieves superior performance on multiple benchmarks compared to MORLHF, Rewarded Soups, and RiC, while substantially reducing GPU requirements; they also show a scaling law where larger reward models yield better policy performance, and that the approach supports controllability across preference dimensions and intensities. The work provides a flexible, efficient path for dynamic, off-the-shelf-model alignment in real-world scenarios, with a final reward computation described by $r_{score} = \frac{1}{k} \sum_{i=1}^{k} \frac{(p_i^{target}-p_i^{avg})}{p_i^{std}} \cdot mask_i$, enabling fine-grained control over multiple preferences at inference time.

Abstract

The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.

MOSLIM:Align with diverse preferences in prompts through reward classification

TL;DR

This paper tackles multi-objective alignment of large language models (LLMs) to diverse human preferences, addressing the inefficiency of current approaches that rely on multiple reward models or preference-specific supervised fine-tuning. It introduces MOSLIM, a framework that uses a single multi-head reward model and a single policy model, with prompt-driven control and a reward-mapping mechanism to produce a scalar reward for optimization, eliminating the need for preference training during the SFT phase. The authors demonstrate that MOSLIM achieves superior performance on multiple benchmarks compared to MORLHF, Rewarded Soups, and RiC, while substantially reducing GPU requirements; they also show a scaling law where larger reward models yield better policy performance, and that the approach supports controllability across preference dimensions and intensities. The work provides a flexible, efficient path for dynamic, off-the-shelf-model alignment in real-world scenarios, with a final reward computation described by , enabling fine-grained control over multiple preferences at inference time.

Abstract

The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.

Paper Structure

This paper contains 16 sections, 12 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: An overview of MOSLIM(OURS), MORLHF, Rewarded Soups(RSoups) during training and infer stages.
  • Figure 2: Reward Model Architecture of MOSLIM
  • Figure 3: Construction process of reward model training datasets.
  • Figure 4: Ablation study on different DataType . The figure illustrates the classification performance across four DataType .
  • Figure 5: Controllability experiment results of preference intensity. From left to right, the subfigures represent preference goals <helpfulness n>, <honesty n>, and <harmless n>, with the y-axis indicating the preference evaluation scores of the model outputs. As the preference intensity $n$ increases, the scores exhibit a clear upward trend.
  • ...and 4 more figures