Table of Contents
Fetching ...

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding

TL;DR

A perspective of dualization is presented that reduces constrained alignment to an equivalent unconstrained alignment problem by pre-optimizing a smooth and convex dual function that has a closed form, greatly reducing the computational burden and improving training stability.

Abstract

The growing safety concerns surrounding large language models raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based settings (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness and merits of our algorithms.

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

TL;DR

A perspective of dualization is presented that reduces constrained alignment to an equivalent unconstrained alignment problem by pre-optimizing a smooth and convex dual function that has a closed form, greatly reducing the computational burden and improving training stability.

Abstract

The growing safety concerns surrounding large language models raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based settings (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness and merits of our algorithms.
Paper Structure (47 sections, 7 theorems, 68 equations, 6 figures, 8 tables, 3 algorithms)

This paper contains 47 sections, 7 theorems, 68 equations, 6 figures, 8 tables, 3 algorithms.

Key Result

Lemma 1

Let Assumption asp:slater hold. Then, there is no duality gap for the problem eqn:constrained RLHF, i.e., $L(\pi^\star, 0) = D(\boldsymbol{\lambda}^\star)$. Moreover, $(\pi^\star,\boldsymbol{\lambda}^\star)$ is a saddle point of the Lagrangian $L$,

Figures (6)

  • Figure 1: An illustration of the dual properties with 128 responses drawn from the Alpaca-7b-reproduced model operating over 1000 prompts from the PKU-SafeRLHF-30K dataset. (Left) The empirical distribution of the safety scores. (Middle) The dual landscape with respect to varying margin $b$. (Right) The convergence of PGD with a constant step size of one and initialization $\lambda^{(0)}=1$.
  • Figure 2: Visualization of MoCAN. (Left) Dual optimization predicts the safety improvement of practically aligned LMs. (Middle & Right) The safety/helpfulness score distribution before and after alignment ($\lambda\space=\space0.75$).
  • Figure 3: Trade-off in improving helpfulness and safety of aligned LMs. (Left) Improvement of helpfulness score versus safety score of MoCAN-aligned LMs under model-based evaluation. (Middle & Right) Helpfulness win rate versus safety win rate of MoCAN-aligned LMs and PeCAN-aligned LMs with $\beta=0.1$, respectively, under GPT-based evaluation.
  • Figure 4: Optimal dual variables as a function of the number of prompts (Left) and number of responses per prompt (Right).
  • Figure 5: Safety score distribution after MoCAN alignment (from left to right, top to bottom, $\lambda=0.1, 0.35, 0.50, 0.90, 1.13, 1.25, 2.0$).
  • ...and 1 more figures

Theorems & Definitions (10)

  • Lemma 1: Strong duality paternain2022safe
  • Lemma 2: Explicit dual function
  • Theorem 1: Properties of the dual function
  • Remark 1: Practical validity of conditions
  • Theorem 2
  • Definition 1: $(\delta, \varepsilon_r, \{\varepsilon_{g_j}\}_{j\,=\,1}^m)$-model-accuracy
  • Theorem 3
  • Theorem 4: Lemm C.2 of chang2024dataset
  • Theorem 5
  • proof