Table of Contents
Fetching ...

ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

TL;DR

ParetoHqD tackles offline multiobjective alignment of large language models by recasting user preferences as directions in reward space and treating data near the Pareto front as high-quality. It employs a two-stage SFT pipeline guided by Pareto high-quality data, with data augmentation to mitigate overfitting and to span concave and convex regions of the front. Across two diverse tasks, ParetoHqD achieves superior Pareto fronts and higher hypervolume than five baselines, while substantially reducing language collapse and maintaining favorable computational efficiency. The work advances practical, personalized, and scalable multiobjective alignment for LLMs by addressing preference representation, data distribution, and training efficiency.

Abstract

Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as "high-quality" data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.

ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

TL;DR

ParetoHqD tackles offline multiobjective alignment of large language models by recasting user preferences as directions in reward space and treating data near the Pareto front as high-quality. It employs a two-stage SFT pipeline guided by Pareto high-quality data, with data augmentation to mitigate overfitting and to span concave and convex regions of the front. Across two diverse tasks, ParetoHqD achieves superior Pareto fronts and higher hypervolume than five baselines, while substantially reducing language collapse and maintaining favorable computational efficiency. The work advances practical, personalized, and scalable multiobjective alignment for LLMs by addressing preference representation, data distribution, and training efficiency.

Abstract

Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as "high-quality" data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.

Paper Structure

This paper contains 22 sections, 14 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison with other baselines, where SFT refers to supervised fine-tuning on LLMs.
  • Figure 2: Schematic diagram of issues in offline methods for representing preferences.
  • Figure 3: Data distribution across normalized reward scores (harmless vs helpful) on the HH-RLHF dataset.
  • Figure 4: Some constructed preference directions of HH-RLHF dataset, where ${\bm{\omega }}$ is set to [0.0,1.0], [0.2,0.8], [0.4,0.6], [0.6,0.4], [0.8,0.2], and [1.0,0.0], respectively.
  • Figure 5: Results of two tasks with normalized rewards, where each point represents the average rewards evaluated on the test set corresponding to a user preference. For the Helpful Assistant task (a), (d) and the Reddit Summary task (b), we set the human preferences ${{\bm{\omega }}}$ to [0.0,1.0], [0.1,0.9], [0.2,0.8], [0.3,0.7], [0.4,0.6], [0.5,0.5], [0.6,0.4], [0.7,0.3], [0.8,0.2], [0.9,0.1] and [1.0,0.0], respectively. For the Helpful Assistant task with three objectives (c), we set the human preferences ${{\bm{\omega }}}$ to [0.0,0.0,1.0], [0.0,1.0,0.0], [0.1,0.1,0.8], [0.1,0.8,0.1], [0.2,0.2,0.6], [0.2,0.6,0.2], [0.4,0.4,0.2], [0.6,0.2,0.2], [0.8,0.1,0.1], [0.33,0.33,0.33] and [1.0,0.0,0.0]. The 5 preferences for MORLHF and MODPO are highlighted in italics.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1