RLVF: Learning from Verbal Feedback without Overgeneralization

Moritz Stephan; Alexander Khazatsky; Eric Mitchell; Annie S Chen; Sheryl Hsu; Archit Sharma; Chelsea Finn

RLVF: Learning from Verbal Feedback without Overgeneralization

Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn

TL;DR

This work tackles the challenge of learning from high-level verbal feedback without causing overgeneralization across contexts. It introduces Contextualized Critiques with Constrained Preference Optimization (C3PO), which synthetically generates in-scope, near-scope, and out-of-scope prompts to steer context-aware fine-tuning without excessive human annotation. The approach combines direct preference optimization on in-scope data with regularization on feedback-irrelevant prompts, underpinned by a theoretical connection to the Bradley–Terry model via two-policy data. Empirically, C3PO achieves strong feedback adherence in relevant contexts while substantially reducing unintended changes elsewhere (around a 30% improvement in overgeneralization) and remains composable for multiple feedbacks through LoRA-based parameter mixing.

Abstract

The diversity of contexts in which large language models (LLMs) are deployed requires the ability to modify or customize default model behaviors to incorporate nuanced requirements and preferences. A convenient interface to specify such model adjustments is high-level verbal feedback, such as "Don't use emojis when drafting emails to my boss." However, while writing high-level feedback is far simpler than collecting annotations for reinforcement learning from human feedback (RLHF), we find that simply prompting a model with such feedback leads to overgeneralization of the feedback to contexts where it is not relevant. We study the problem of incorporating verbal feedback without such overgeneralization, inspiring a new method Contextualized Critiques with Constrained Preference Optimization (C3PO). C3PO uses a piece of high-level feedback to generate a small synthetic preference dataset specifying how the feedback should (and should not) be applied. It then fine-tunes the model in accordance with the synthetic preference data while minimizing the divergence from the original model for prompts where the feedback does not apply. Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts. For both human- and GPT-4-generated high-level feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.

RLVF: Learning from Verbal Feedback without Overgeneralization

TL;DR

Abstract

Paper Structure (30 sections, 14 equations, 9 figures, 3 tables)

This paper contains 30 sections, 14 equations, 9 figures, 3 tables.

Introduction
Related Work
Preliminaries
Reinforcement Learning from Verbal Feedback using Contextualized Critiques with Constrained Preference Optimization
Experiments
Quantifying and mitigating overgeneralization
Adhering to multiple feedbacks
Choice of C3PO constraint formulation
Discussion & Future Work
Sampling Details
Training Details
Derivation of Optimal Policy for PbRL on Two-Policy Preference Pairs
Deriving the underlying Bradley-Terry scoring function for synthetic two-policy preference data
Deriving the optimal policy for C3PO for in-scope prompts
Empirically validating the in-scope C3PO policy in a synthetic setting
...and 15 more sections

Figures (9)

Figure 1: We consider the problem of leveraging high-level, verbal feedback (left) to refine model behaviors (center). Prior approaches often struggle to appropriately update the model, leading to either failure to adhere to the feedback or overgeneralization (right).
Figure 2: C3PO mitigates the overgeneralization problem when learning from high-level feedback. For existing approaches for incorporating high-level feedback, high feedback adherence on in-scope prompts (x axis) strongly predicts a large change in behavior for out-of-scope prompts (y axis), which is undesirable. In contrast, our approach C3PO decreases the rate at which out-of-scope behavior is affected as in-scope feedback adherence improves. Lines of best fit are computed with linear orthogonal regression.
Figure 3: C3PO Data Generation Scheme. Given human feedback, C3PO begins by generating a set of categories of prompts where the feedback may be relevant using GPT-4. GPT-4 then generates in-scope prompts $x_i^\text{in-scope}$ and near-scope prompts $x_i^\text{near-scope}$. A set of out-of-scope prompts $x_i^\text{out-of-scope}$ is also taken from a prior dataset. The current model then generates a baseline response to each of these, giving $y_i^-$, $y_i^{\text{near-scope}}$, $y_i^{\text{out-of-scope}}$, respectively. We also prompt the current model to revise $y_i^-$ to incorporate the feedback, giving a revised response $y_i^+$. This data generation scheme is the first stage of C3PO--autonomously generating fine-tuning datasets $\mathcal{D}_\text{in-scope}$, $\mathcal{D}_\text{near-scope}$ and $\mathcal{D}_\text{out-of-scope}$, the latter two of which are used to prevent overgeneralization on irrelevant tasks.
Figure 4: C3PO Fine-Tuning Objective. C3PO facilitates feedback adherence for relevant prompts by fine-tuning with DPO on the generated in-scope data while minimizing overgeneralization through SFT losses on the generated out-of-scope and near-scope data, which regularizes model behavior towards the original model for feedback-irrelevant prompts.
Figure 5: Sample responses from C3PO and each baseline for an in-scope and out-of-scope prompt. Only C3PO correctly adheres to the feedback for the in-scope input and ignores the feedback for the out-of-scope input.
...and 4 more figures

RLVF: Learning from Verbal Feedback without Overgeneralization

TL;DR

Abstract

RLVF: Learning from Verbal Feedback without Overgeneralization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)