Table of Contents
Fetching ...

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Nuoya Xiong, Aarti Singh

TL;DR

Reinforcement Learning with Human Feedback (RLHF) often optimizes a single reward, but real deployments require balancing multiple objectives and diverse user groups. The authors propose Projection Optimization for MORLHF (MOPO), a framework that converts non-linear p-norm aggregation with $p\le1$ into a sequence of linear subproblems and extends to multi-group settings with group-specific weights and norms. They provide sublinear regret guarantees in both offline and online regimes and derive a nearly training-free variant when per-objective policies are already available. Empirically, MOPO delivers competitive performance against baselines and demonstrates stable behavior in multi-group settings, with significant computational advantages over prior non-linear aggregation methods.

Abstract

Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning model, particularly Language Model (LM) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices. Multi-Objective RLHF (MORLHF) aims to use per-objective preference feedback and achieve Pareto optimality among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

TL;DR

Reinforcement Learning with Human Feedback (RLHF) often optimizes a single reward, but real deployments require balancing multiple objectives and diverse user groups. The authors propose Projection Optimization for MORLHF (MOPO), a framework that converts non-linear p-norm aggregation with into a sequence of linear subproblems and extends to multi-group settings with group-specific weights and norms. They provide sublinear regret guarantees in both offline and online regimes and derive a nearly training-free variant when per-objective policies are already available. Empirically, MOPO delivers competitive performance against baselines and demonstrates stable behavior in multi-group settings, with significant computational advantages over prior non-linear aggregation methods.

Abstract

Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning model, particularly Language Model (LM) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices. Multi-Objective RLHF (MORLHF) aims to use per-objective preference feedback and achieve Pareto optimality among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.

Paper Structure

This paper contains 42 sections, 16 theorems, 229 equations, 6 tables, 5 algorithms.

Key Result

Theorem 3.3

Define the max-min value as $c^*=\max_\pi [\min_i \mathbb{E}_\pi [r_i^*] - \beta \mathbb{D}_{\mathrm{KL}}$$(\pi \|\pi_{\mathrm{ref}})]$. Then, if we choose the target set $W_{-\infty, c}^\alpha$ such that $c$ is close to $c^*$, the resulting optimal policy also achieves a max-min value that close to

Theorems & Definitions (32)

  • Example 3.1: $p=1:$ Linear Aggregation
  • Example 3.2: $p=-\infty:$ worst-case reward
  • Theorem 3.3
  • Theorem 5.1: Consensus Problem
  • Theorem 5.2: Malfare
  • Theorem 5.4: Consensus
  • Theorem 5.5: Malfare
  • proof
  • Lemma B.1
  • proof
  • ...and 22 more