Learning from negative feedback, or positive feedback or both

Abbas Abdolmaleki; Bilal Piot; Bobak Shahriari; Jost Tobias Springenberg; Tim Hertweck; Rishabh Joshi; Junhyuk Oh; Michael Bloesch; Thomas Lampe; Nicolas Heess; Jonas Buchli; Martin Riedmiller

Learning from negative feedback, or positive feedback or both

Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe, Nicolas Heess, Jonas Buchli, Martin Riedmiller

TL;DR

This work addresses learning from preference data when only unpaired positive or negative feedback is available, by decoupling learning from the two feedback types within an EM-inspired RL-as-inference framework. The proposed PMPO algorithm extends traditional EM-based policy optimization to explicitly incorporate negative outcomes through dis-preferences, using a KL regularization term to stay close to a reference policy. The approach supports learning from solely positive, solely negative, or both types of feedback and is demonstrated across bandit benchmarks, continuous-control tasks, offline RL, and language alignment, often matching or surpassing strong baselines. This flexible, data-efficient framework reduces annotation requirements and enables robust policy improvement in settings with partial or unpaired feedback, with practical implications for LLM alignment and safety-sensitive control tasks.

Abstract

Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback--for example, either positive or negative--is available. To address this, we introduce a novel approach that decouples learning from positive and negative feedback. This decoupling enables control over the influence of each feedback type and, importantly, allows learning even when only one feedback type is present. A key contribution is demonstrating stable learning from negative feedback alone, a capability not well-addressed by current methods. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which uses expectation-maximization (EM) to directly optimize the probability of positive outcomes (as opposed to classic expected reward maximization). We address a key limitation in current EM-based methods: they solely maximize the likelihood of positive examples, while neglecting negative ones. We show how to extend EM algorithms to explicitly incorporate negative examples, leading to a theoretically grounded algorithm that offers an intuitive and versatile way to learn from both positive and negative feedback. We evaluate our approach for training language models based on human feedback as well as training policies for sequential decision-making problems, where learned value functions are available.

Learning from negative feedback, or positive feedback or both

TL;DR

Abstract

Paper Structure (35 sections, 62 equations, 8 figures, 2 tables)

This paper contains 35 sections, 62 equations, 8 figures, 2 tables.

Introduction
Related Work
RL as Inference
Preference optimization
Using positive and negative feedback for policy optimization
Background on maximising for preferred outcomes
Objective.
E-step:
M-Step:
Using dis-preferred outcomes via regularised minimum likelihood
Learning from preferred and dis-preferred outcomes
Extracting preferences from evaluation functions
Experiments
Bandit RL: Standard Functions
Full online RL: control suite
...and 20 more sections

Figures (8)

Figure 1: Performance of PMPO and DPO on Benchmark Functions - This figure illustrates the optimization progress of PMPO variants (PMPO-AR, PMPO-A, PMPO-R) on a selection of standard benchmark functions, showcasing their ability to leverage different types of preference feedback.
Figure 2: Comparison of PMPO/DPO/MPO for high-dimensional control tasks from the DeepMind Control Suite. We plot average reward over time of training (using 100 episodes for each evaluation).
Figure 3: Impact of the KL weight 'beta' on the performance of PMPO. When learning solely from dispreferences across various Control Suite tasks (Reject, $\alpha=0$), a sufficiently high beta value is required for effective learning. However, when learning from preferences only (Accept) PMPO is robustness to the KL weight 'beta' across different Control Suite tasks, confirming theoretical insights. When both both accept and reject signals are used (Accept & Reject), PMPO shows a partial sensitivity to KL Weight 'beta'. While learning is possible with a wider range of beta values, a beta higher than 0.5 is generally needed for optimal performance.
Figure 4: Left: Impact of Combining Accept and Reject Signals - The plot demonstrates the learning progress of PMPO-AR (using both accept and reject signals) compared to PMPO-A and PMPO-R, showcasing faster learning when leveraging both types of feedback in language alignment task and is competitive with DPO. Right: Win-rate when doing A/B comparisons on held-out prompts for PMPO against the base Gemma checkpoint as judged by GPT-4.
Figure 5: Rewards obtained by the policy at each training step, averaged over the batch and smoothed. Each curve corresponds to a configuration of $\beta$ specified in the legend. This figure illustrates the ability of PMPO to learn effectively from various preference signals (accept-only, reject-only, or both) in language alignment tasks. highlighting its adaptability to different preference acquisition settings.
...and 3 more figures

Theorems & Definitions (2)

proof
proof

Learning from negative feedback, or positive feedback or both

TL;DR

Abstract

Learning from negative feedback, or positive feedback or both

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (2)