Table of Contents
Fetching ...

PEO: Improving Bi-Factorial Preference Alignment with Post-Training Policy Extrapolation

Yuxuan Liu

TL;DR

This work tackles the challenge of aligning large language models to multiple human preferences, notably helpfulness and harmlessness. It introduces Post-Training Extrapolation Optimization (PEO), a three-phase pipeline that first learns aspect-specific policies, then initializes a generalist via interpolation, and finally applies post-training extrapolation to achieve Pareto-optimal trade-offs without additional retraining. The method yields a superior Pareto front across diverse base models, enabling dynamic, inference-time steering of preferences while reducing training costs compared to MORL or soup-based approaches. Theoretical insights and extensive experiments demonstrate PEO’s ability to overcome optimization bottlenecks, generalize to novel instructions, and provide scalable, personalized alignment with practical inference-time control.

Abstract

The alignment of large language models with human values presents a critical challenge, particularly when balancing conflicting objectives like helpfulness and harmlessness. Existing approaches, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), face notable limitations: RLHF suffers from instability and inefficiency in multi-objective optimization, while DPO lacks mechanisms for dynamic trade-offs. To address these challenges, we propose Post-Training Extrapolation Optimization (PEO), a novel and efficient framework for bi-factorial alignment. PEO generates a family of Pareto-optimal policies in a single training pass by leveraging a three-phase pipeline: (1) aspect-specific learning, (2) generalist initialization via interpolation, and (3) post-training optimization via extrapolation. PEO enables dynamic adaptation to diverse user preferences at inference time without retraining. Our comprehensive experiments across multiple LLMs demonstrate that PEO achieves superior Pareto fronts compared to baselines, offering improved flexibility and computational efficiency. Theoretical analyses further highlight PEO's capacity to overcome optimization bottlenecks, paving the way for scalable, personalized alignment.

PEO: Improving Bi-Factorial Preference Alignment with Post-Training Policy Extrapolation

TL;DR

This work tackles the challenge of aligning large language models to multiple human preferences, notably helpfulness and harmlessness. It introduces Post-Training Extrapolation Optimization (PEO), a three-phase pipeline that first learns aspect-specific policies, then initializes a generalist via interpolation, and finally applies post-training extrapolation to achieve Pareto-optimal trade-offs without additional retraining. The method yields a superior Pareto front across diverse base models, enabling dynamic, inference-time steering of preferences while reducing training costs compared to MORL or soup-based approaches. Theoretical insights and extensive experiments demonstrate PEO’s ability to overcome optimization bottlenecks, generalize to novel instructions, and provide scalable, personalized alignment with practical inference-time control.

Abstract

The alignment of large language models with human values presents a critical challenge, particularly when balancing conflicting objectives like helpfulness and harmlessness. Existing approaches, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), face notable limitations: RLHF suffers from instability and inefficiency in multi-objective optimization, while DPO lacks mechanisms for dynamic trade-offs. To address these challenges, we propose Post-Training Extrapolation Optimization (PEO), a novel and efficient framework for bi-factorial alignment. PEO generates a family of Pareto-optimal policies in a single training pass by leveraging a three-phase pipeline: (1) aspect-specific learning, (2) generalist initialization via interpolation, and (3) post-training optimization via extrapolation. PEO enables dynamic adaptation to diverse user preferences at inference time without retraining. Our comprehensive experiments across multiple LLMs demonstrate that PEO achieves superior Pareto fronts compared to baselines, offering improved flexibility and computational efficiency. Theoretical analyses further highlight PEO's capacity to overcome optimization bottlenecks, paving the way for scalable, personalized alignment.

Paper Structure

This paper contains 41 sections, 12 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Illustration of proposed PEO. PEO includes three phases: 1) Aspect-Specific Learning via DPO, 2) Generalist Initialization via Interpolation, and 3) Post-Training Optimization via Extrapolation.
  • Figure 2: Optimal Pareto Front curves obtained via PEO and other preference optimization variants across different base models, on the validation instructions of BeaverTails. Reward (higher is better) and Cost (lower is better) denote the agent's alignment performance on helpfulness and harmlessness, respectively.
  • Figure 3: Helpfulness and Harmlessness win rates of PEO against baselines on AlpacaEval (Gemma-2B)
  • Figure 4: Helpfulness and Harmlessness win rates of PEO against baselines on HH test sets (Gemma-2B)
  • Figure 5: Sensitivity analysis to extrapolation weights $\lambda$ in PEO (Upper: LLama3-8B, Lower: Llama1-7B). Experiments are conducted on the test set of BeaverTails.
  • ...and 5 more figures