Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

Yi-Chen Li; Fuxiang Zhang; Wenjie Qiu; Lei Yuan; Chengxing Jia; Zongzhang Zhang; Yang Yu; Bo An

Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

Yi-Chen Li, Fuxiang Zhang, Wenjie Qiu, Lei Yuan, Chengxing Jia, Zongzhang Zhang, Yang Yu, Bo An

TL;DR

This work addresses customizing pre-trained LLMs to new human preferences without forgetting existing capabilities. It reframes customization as reward maximization over two components $r_1$ and $r_2$, but avoids direct access to $r_1$ by learning a residual Q-function $\hat{Q}$ that, together with the pre-trained policy $\pi_1^*$, yields the customized policy $\tilde{\pi}^*$. The proposed Q-Adapter learns $Q_\theta$ to approximate $\hat{Q}$, uses $\alpha_0=\lambda\alpha_1$ to bypass unknown entropy weight, and combines adapter and base model outputs to optimize $\lambda r_1 + r_2$ under a maximum-entropy framework. Empirical results on Llama-3.1 show that Q-Adapter preserves general capabilities while effectively learning new preferences across domain-specific and cross-task scenarios, with publicly available code for reproducibility.

Abstract

Large Language Models (LLMs), trained on a large amount of corpus, have demonstrated remarkable abilities. However, it may not be sufficient to directly apply open-source LLMs like Llama to certain real-world scenarios, since most of them are trained for \emph{general} purposes. Thus, the demands for customizing publicly available LLMs emerge, but are currently under-studied. In this work, we consider customizing pre-trained LLMs with new human preferences. Specifically, the LLM should not only meet the new preference but also preserve its original capabilities after customization. Drawing inspiration from the observation that human preference can be expressed as a reward model, we propose to cast LLM customization as optimizing the sum of two reward functions, one of which (denoted as $r_1$) was used to pre-train the LLM while the other (denoted as $r_2$) characterizes the new human preference. The obstacle here is that both reward functions are unknown, making the application of modern reinforcement learning methods infeasible. Thanks to the residual Q-learning framework, we can restore the customized LLM with the pre-trained LLM and the \emph{residual Q-function} without the reward function $r_1$. Moreover, we find that for a fixed pre-trained LLM, the reward function $r_2$ can be derived from the residual Q-function, enabling us to directly learn the residual Q-function from the new human preference data upon the Bradley-Terry model. We name our method Q-Adapter as it introduces an adapter module to approximate the residual Q-function for customizing the pre-trained LLM towards the new preference. Experiments based on the Llama-3.1 model on the DSP dataset and HH-RLHF dataset illustrate the superior effectiveness of Q-Adapter on both retaining existing knowledge and learning new preferences. Code is available at https://github.com/mansicer/Q-Adapter.

Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

TL;DR

This work addresses customizing pre-trained LLMs to new human preferences without forgetting existing capabilities. It reframes customization as reward maximization over two components

and

, but avoids direct access to

by learning a residual Q-function

that, together with the pre-trained policy

, yields the customized policy

. The proposed Q-Adapter learns

to approximate

, uses

to bypass unknown entropy weight, and combines adapter and base model outputs to optimize

under a maximum-entropy framework. Empirical results on Llama-3.1 show that Q-Adapter preserves general capabilities while effectively learning new preferences across domain-specific and cross-task scenarios, with publicly available code for reproducibility.

Abstract

) was used to pre-train the LLM while the other (denoted as

) characterizes the new human preference. The obstacle here is that both reward functions are unknown, making the application of modern reinforcement learning methods infeasible. Thanks to the residual Q-learning framework, we can restore the customized LLM with the pre-trained LLM and the \emph{residual Q-function} without the reward function

. Moreover, we find that for a fixed pre-trained LLM, the reward function

can be derived from the residual Q-function, enabling us to directly learn the residual Q-function from the new human preference data upon the Bradley-Terry model. We name our method Q-Adapter as it introduces an adapter module to approximate the residual Q-function for customizing the pre-trained LLM towards the new preference. Experiments based on the Llama-3.1 model on the DSP dataset and HH-RLHF dataset illustrate the superior effectiveness of Q-Adapter on both retaining existing knowledge and learning new preferences. Code is available at https://github.com/mansicer/Q-Adapter.

Paper Structure (44 sections, 6 theorems, 22 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 44 sections, 6 theorems, 22 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Language Generation as a Token-level MDP
Reinforcement Learning from Human Feedback
Residual Q-learning
LLM Customization as Reward Maximization
Comparison with Policy Regularization Methods
Q-Adapter
Experiment
Experimental Setup
Datasets
Evaluation Metrics
Baselines
Practical Implementation
Customization to Domain-specific Datasets
...and 29 more sections

Key Result

Proposition 2.0

equ:rlhf is equivalent to where $\mathcal{H}\pqty{\pi(\cdot|s_t)} = \mathbb{E}_{a_t\sim\pi(\cdot|s_t)}\bqty{-\log \pi(a_t|s_t)}$ is the entropy of $\pi$ at the state $s_t$, is the soft Q-function of the LLM $\pi$ with the expectation being taken over the randomness of $\pi$ and $\mathcal{P}$, i.e, $a_t \sim \pi(\cdot|s_t)$, $s_{t+1}\sim\mathcal{P}\pqty{\cdot|s_t,a_t}$. The reward $r^\mathrm{KL}_

Figures (6)

Figure 1: An example of adapting a pre-trained LLM while preserving its original knowledge. Left: Suppose that we have a pre-trained helpful LLM. After adapting it to preference data of harmlessness, we would like it to be both helpful and harmless. Right: A corresponding case. The customized LLM not only inherits helpful knowledge from the pre-trained LLM, but also be much more harmless by learning from preference data on harmlessness.
Figure 2: Visualization of forgetting during training. The curves illustrate the MMLU scores of Q-Adapter, Replay, and PR (DPO) under different training checkpoints.
Figure 3: Evaluation on $r_1$ & $r_2$ in HH-RLHF: Win rates of Q-Adapter, Replay, and PR (DPO) against the SFT model in the helpful data ($r_1$) and the harmless data ($r_2$) from HH-RLHF.
Figure 4: Win rates of Q-Adapter against the SFT model with different $\alpha_0$ on HH-RLHF.
Figure 5: MMLU Scores of Q-Adapter with different choices of the hyper-parameter $\beta$.
...and 1 more figures

Theorems & Definitions (9)

Proposition 2.0
Proposition 2.0: residualq
Corollary 4.1
proof
Proposition A.0
proof
Lemma A.1: Theorem 1 of sac
Proposition A.1: residualq
proof

Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

TL;DR

Abstract

Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)