Table of Contents
Fetching ...

Stepwise Alignment for Constrained Language Model Policy Optimization

Akifumi Wachi, Thien Q. Tran, Rei Sato, Takumi Tanabe, Youhei Akimoto

TL;DR

The theoretical analysis provides the upper bounds on optimality and safety constraint violation and the experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.

Abstract

Safety and trustworthiness are indispensable requirements for real-world applications of AI systems using large language models (LLMs). This paper formulates human value alignment as an optimization problem of the language model policy to maximize reward under a safety constraint, and then proposes an algorithm, Stepwise Alignment for Constrained Policy Optimization (SACPO). One key idea behind SACPO, supported by theory, is that the optimal policy incorporating reward and safety can be directly obtained from a reward-aligned policy. Building on this key idea, SACPO aligns LLMs step-wise with each metric while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO offers several advantages, including simplicity, stability, computational efficiency, and flexibility of algorithms and datasets. Under mild assumptions, our theoretical analysis provides the upper bounds on optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.

Stepwise Alignment for Constrained Language Model Policy Optimization

TL;DR

The theoretical analysis provides the upper bounds on optimality and safety constraint violation and the experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.

Abstract

Safety and trustworthiness are indispensable requirements for real-world applications of AI systems using large language models (LLMs). This paper formulates human value alignment as an optimization problem of the language model policy to maximize reward under a safety constraint, and then proposes an algorithm, Stepwise Alignment for Constrained Policy Optimization (SACPO). One key idea behind SACPO, supported by theory, is that the optimal policy incorporating reward and safety can be directly obtained from a reward-aligned policy. Building on this key idea, SACPO aligns LLMs step-wise with each metric while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO offers several advantages, including simplicity, stability, computational efficiency, and flexibility of algorithms and datasets. Under mild assumptions, our theoretical analysis provides the upper bounds on optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.
Paper Structure (42 sections, 21 theorems, 101 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 21 theorems, 101 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Define the dual function $D(\lambda, \beta) \coloneqq \max_\pi L(\pi, \lambda, \beta)$ and the optimal dual variable $\lambda^\star \coloneqq \mathop{\mathrm{arg\,min}}\limits_{\lambda \ge 0} D(\lambda, \beta)$. Under Assumption assumption:slater, there exists a primal-dual pair $(\pi^\star, \lambda

Figures (7)

  • Figure 1: Safe RLHF dai2024safe respectively fits reward and safety models to reward and safety datasets with human preferences, and then leverages PPO-Lagrangian to optimize an LM policy and a Lagrangian multiplier to balance helpfulness and harmlessness. In contrast, SACPO first aligns an LM policy with the reward metric and then realigns the resulting reward-aligned policy with the safety metric (or vice versa). In this process, we can use simple RL-free algorithms (e.g., DPO, KTO) for each step, which leads to simplicity, stability, and flexibility.
  • Figure 2: Win rate against the SFT model. H and S are abbreviations for helpfulness and safety (i.e., harmlessness), respectively. Crosses represent SFT and Safe RLHF, and blue circles represent models aligned with a single metric. (a) DPO (H) $\rightarrow$ DPO (S), DPO (H) $\rightarrow$ KTO (S), and KTO (H) $\rightarrow$ DPO (S). (b) DPO (S) $\rightarrow$ DPO (H). (c) P-SACPO based on linear model merging. In (a) and (b), the numbers indicate $\beta/\lambda$. In (c), the numbers for the red triangles represent $\beta/\lambda$, while those for the green and purple squares represent $q$.
  • Figure 3: Elo scores of DPO (H) $\rightarrow$ DPO (S) and DPO (H) $\rightarrow$ KTO (S).
  • Figure 4:
  • Figure 5:
  • ...and 2 more figures

Theorems & Definitions (42)

  • Lemma 1: Strong duality
  • Lemma 2: Boundness of $\lambda^\star$
  • Theorem 1: Relation between $\piopt_{r^\star}$ and $\piopt$
  • Remark 1: Importance of reverse KL in \ref{['eq:rlhf_obj']} and \ref{['eq:constrained_problem']}
  • Remark 2: Commutative law
  • Definition 1: $\delta$-uncertainty quantifier
  • Lemma 3: Reward and safety $\delta$-uncertainty quantifiers
  • Theorem 2: Optimality
  • Theorem 3: Safety constraint violation
  • Lemma 4
  • ...and 32 more