Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

Kaiwen Wang; Rahul Kidambi; Ryan Sullivan; Alekh Agarwal; Christoph Dann; Andrea Michi; Marco Gelmi; Yunxuan Li; Raghav Gupta; Avinava Dubey; Alexandre Ramé; Johan Ferret; Geoffrey Cideron; Le Hou; Hongkun Yu; Amr Ahmed; Aranyak Mehta; Léonard Hussenot; Olivier Bachem; Edouard Leurent

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, Alexandre Ramé, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, Léonard Hussenot, Olivier Bachem, Edouard Leurent

TL;DR

This paper presents Conditional Language Policy, a general framework for finetuning language models on multiple objectives that learns steerable language models that outperform and Pareto-dominate the existing approaches for multi-objective finetuning.

Abstract

Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditional Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through extensive experiments and ablations on two summarization datasets, we show that CLP learns steerable language models that outperform and Pareto-dominate the existing approaches for multi-objective finetuning.

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

TL;DR

Abstract

Paper Structure (52 sections, 7 theorems, 25 equations, 19 figures, 2 algorithms)

This paper contains 52 sections, 7 theorems, 25 equations, 19 figures, 2 algorithms.

Introduction
Problem Setup
Conditional Language Policy (CLP)
Conditioning Mechanism
Three Instantiations of CLP
Experiments
Core Benchmarking Results
Single Reward, Multi KL Regularizer
Two Rewards, Fixed KL Regularizer
Three Rewards, Fixed KL Regularizer
Ablation Studies
Effect of Training Iterations
CLP With Prompt Conditioning
Model Size
Automated Evaluation
...and 37 more sections

Key Result

Theorem 1

Suppose $\widehat{\pi}_1,\widehat{\pi}_2$ are $\varepsilon$-optimal policies for eq:single-objective-ft with $R_1,R_2$, respectively. For any $\lambda\in[0,1]$, let $\widehat{\pi}_\lambda$ be the logit mixture of $\widehat{\pi}_1$ and $\widehat{\pi}_2$. Then, the sub-optimality of $\widehat{\pi}_\la where $p_{\min}$ is the minimum probability of $\widehat{\pi}_1,\widehat{\pi}_2$ over all input-out

Figures (19)

Figure 1: (Left) For a fixed prompt $x$, a multi-objective LM can output $y_1,y_2,y_3$ for different weightings $w_1,w_2,w_3$ of two rewards $r_1$ and $r_2$, such that the response $y_i$ for weighting $w_i$ is preferred under the weighted reward $w_i[1] r_1 + w_i[2] r_2$. (Right) Pareto-fronts when using the rewards NLI and Rouge (\ref{['sec:two-rewards-fixed-kl-regularizer']}). Rewarded Soups (RS) rame2023soups is Pareto-dominated by both full-CLP ( this paper) and prompting (say, jang2023personalized), but full-CLP is more appealing for its steerability, evidenced by its wider Pareto-front. In sum, Pareto-dominance (pushing out the front) and steerability (stretching out the front) are both key desiderata for MOFT.
Figure 2: Pareto curves for single-reward, multi-$\alpha$. Observe CLP variants (full-CLP and attn-CLP) are competitive with DeRA, a baseline that is nearly $2\times$ expensive to run at inference time.
Figure 3: Pareto-curves for two-reward & $\alpha=0.01$. Observe CLP variants (full-CLP and attn-CLP) offer improved spread (compared to prompting) while Pareto-dominating the Rewarded Soups (RS) baseline.
Figure 4: Barplot of $1.0-V_{w^\top\bm{R}}(\pi_{\text{Alg}}(\cdot;w))/\widetilde{V}_{w^\top\bm{R}}^{\text{RS}}$ for three-reward experiments, where $\widetilde{V}_{w^\top\bm{R}}^{\text{RS}}$ is the KL-regularized reward of RS for weighting $w$. Lower is better and $0$ is on-par with RS.
Figure 5: Pareto-curves at $10k,60k,90k$ training steps. Observe that prompting shows slightly improved steerability with a $3\times$ larger training budget but still isn't as steerable as full-CLP which exhibits a strong steerability even at $10k$ iterations.
...and 14 more figures

Theorems & Definitions (12)

Theorem 1
Theorem 2
proof : Proof of \ref{['thm:logit-mixing-mo']}.
Lemma 1
proof
Lemma 2: Hoeffding's Lemma
Lemma 3
proof
Lemma 4
proof
...and 2 more

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

TL;DR

Abstract

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (12)