Table of Contents
Fetching ...

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang

Abstract

Prior work in multi-objective reinforcement learning typically uses linear reward scalarization with fixed weights, which provably fails to capture non-convex Pareto fronts and thus yields suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: hypervolume-guided weight adaptation and gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms, effectiveness across multiple datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Abstract

Prior work in multi-objective reinforcement learning typically uses linear reward scalarization with fixed weights, which provably fails to capture non-convex Pareto fronts and thus yields suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: hypervolume-guided weight adaptation and gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms, effectiveness across multiple datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.

Paper Structure

This paper contains 26 sections, 2 theorems, 19 equations, 10 figures, 5 tables, 2 algorithms.

Key Result

Lemma 5.1

Recall the weight update rule defined in Eq. eq: weight optimization, its true weight takes the closed form:

Figures (10)

  • Figure 1: Pareto fronts obtained by our gradient-based weight optimization compared to three baselines using fixed-weight reward interpolation. We train the Qwen3-8B model yang2025qwen3technicalreport on the Math500 dataset lightman2023lets using GRPO shao2024deepseekmathpushinglimitsmathematical. The three training configurations, accuracy-focused, balanced, and efficiency-focused, correspond to different weight distributions initialized to our optimization objectives: accuracy, conciseness, and clarity. We aim to train models that achieve strong problem-solving ability (higher accuracy) with computational efficiency (fewer tokens) while maintaining interpretable reasoning processes (better clarity). Gray dots indicate Pareto suboptimal checkpoints generated along training. Clearly, our dynamic reward weighting consistently builds superior Pareto fronts that dominate baselines across all objectives, demonstrating its effectiveness in multi-objective alignment.
  • Figure 2: Meta-reward $r_{\text{pareto}}$ distributions, with vertical dashed lines indicating the average value.
  • Figure 3: Reward weight evolution over training.
  • Figure 4: Pareto fronts under different datasets.
  • Figure 5: Pareto fronts of Qwen3-8B trained with SafeSQL yao2025traininglanguagemodelsgenerate problems.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Definition 3.1: Pareto Front
  • Definition 3.2: Hypervolume Indicator
  • Definition 3.3: Hypervolume Contribution
  • Remark
  • Lemma 5.1
  • proof
  • Theorem 5.2
  • proof
  • Remark