Learning from negative feedback, or positive feedback or both
Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe, Nicolas Heess, Jonas Buchli, Martin Riedmiller
TL;DR
This work addresses learning from preference data when only unpaired positive or negative feedback is available, by decoupling learning from the two feedback types within an EM-inspired RL-as-inference framework. The proposed PMPO algorithm extends traditional EM-based policy optimization to explicitly incorporate negative outcomes through dis-preferences, using a KL regularization term to stay close to a reference policy. The approach supports learning from solely positive, solely negative, or both types of feedback and is demonstrated across bandit benchmarks, continuous-control tasks, offline RL, and language alignment, often matching or surpassing strong baselines. This flexible, data-efficient framework reduces annotation requirements and enables robust policy improvement in settings with partial or unpaired feedback, with practical implications for LLM alignment and safety-sensitive control tasks.
Abstract
Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback--for example, either positive or negative--is available. To address this, we introduce a novel approach that decouples learning from positive and negative feedback. This decoupling enables control over the influence of each feedback type and, importantly, allows learning even when only one feedback type is present. A key contribution is demonstrating stable learning from negative feedback alone, a capability not well-addressed by current methods. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which uses expectation-maximization (EM) to directly optimize the probability of positive outcomes (as opposed to classic expected reward maximization). We address a key limitation in current EM-based methods: they solely maximize the likelihood of positive examples, while neglecting negative ones. We show how to extend EM algorithms to explicitly incorporate negative examples, leading to a theoretically grounded algorithm that offers an intuitive and versatile way to learn from both positive and negative feedback. We evaluate our approach for training language models based on human feedback as well as training policies for sequential decision-making problems, where learned value functions are available.
