COPR: Continual Learning Human Preference through Optimal Policy Regularization

Han Zhang; Lin Gui; Yuanzhao Zhai; Hui Wang; Yu Lei; Ruifeng Xu

COPR: Continual Learning Human Preference through Optimal Policy Regularization

Han Zhang, Lin Gui, Yuanzhao Zhai, Hui Wang, Yu Lei, Ruifeng Xu

TL;DR

COPR involves a single learning phase and doesn't necessitate complex reinforcement learning, and shares the capability with RLHF to learn from unlabeled data by maintaining a scoring module, similar to reward model, making it flexible for continually learning without human feedback.

Abstract

The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Regularization (COPR), in which we compute the distribution of optimal policy bypassing the partition function and then regularize the current policy based on the historically optimal distribution to mitigate Catastrophic Forgetting (CF). COPR involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data by maintaining a scoring module, similar to reward model, making it flexible for continually learning without human feedback. Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines when it comes to consistently aligning with human preferences on incremental tasks and domains.

COPR: Continual Learning Human Preference through Optimal Policy Regularization

TL;DR

Abstract

Paper Structure (22 sections, 20 equations, 5 figures, 4 tables)

This paper contains 22 sections, 20 equations, 5 figures, 4 tables.

Introduction
Preliminaries
Static Alignment
Continual Alignment
Method
Motivation of the Method
Optimal Policy Regularization
Continual Learning on Unlabeled Data
Comparison with other methods
Classic Continuous Learning Assessment Experiments
Datasets
Baselines
Tasks and Evaluation Metrics
Results of Continual Learning from Human Preferences
Learning unlabeled responses
...and 7 more sections

Figures (5)

Figure 1: (a) The framework of COPR. The optimal policy $\pi_t^{*}$$(t=1,2,3)$ is derived from the policy $\pi_{t-1}$. The optimal policy $\pi_t^{*}$ is utilized as the fitting objective of $\pi_{t}$ and the regularization term of $\pi_{t+1}$. (b) A state-of-the-art and elaborated taxonomy SurveyCL2023 of representative continual learning methods. Bold indicates the category to which our method belongs.
Figure 2: The score distribution of pairwise reward learning.
Figure 3: Fitting of $P^{*}_{y \in \mathcal{Y}^x,t}(y|x)$
Figure 4: Evaluation curves of TIL setting.
Figure 5: Enter Caption

COPR: Continual Learning Human Preference through Optimal Policy Regularization

TL;DR

Abstract

COPR: Continual Learning Human Preference through Optimal Policy Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)