Table of Contents
Fetching ...

BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

Sangyun Lee, Brandon Amos, Giulia Fanti

TL;DR

BaNEL addresses post-training for generative models in settings with extremely sparse rewards and costly reward evaluations. It learns from negative failures by training a failure model $p_{\boldsymbol{\phi}}$ on negative samples and constructing a rejection region $\tilde{R}$ via $\frac{p_{\boldsymbol{\theta}}(\boldsymbol{x})}{p_{\boldsymbol{\phi}}(\boldsymbol{x})}<\tau$, forming a Bayesian posterior $p_{\boldsymbol{\theta}|\tilde{R}^C}$ that avoids repeated failures without discarding prior knowledge. The approach supports multiple updates per reward evaluation and sequentially narrows the search space through adaptive rejection regions and distillation, leading to improved success rates with bounded $NRE$. Empirical results on MNIST 0-to-6, adversarial language-model attacks, and GSM8K-Hard show orders-of-magnitude gains over count-based and RND baselines under the same reward-evaluation budget, highlighting BaNEL's compute-scalable capability for extreme sparsity scenarios.

Abstract

Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.

BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

TL;DR

BaNEL addresses post-training for generative models in settings with extremely sparse rewards and costly reward evaluations. It learns from negative failures by training a failure model on negative samples and constructing a rejection region via , forming a Bayesian posterior that avoids repeated failures without discarding prior knowledge. The approach supports multiple updates per reward evaluation and sequentially narrows the search space through adaptive rejection regions and distillation, leading to improved success rates with bounded . Empirical results on MNIST 0-to-6, adversarial language-model attacks, and GSM8K-Hard show orders-of-magnitude gains over count-based and RND baselines under the same reward-evaluation budget, highlighting BaNEL's compute-scalable capability for extreme sparsity scenarios.

Abstract

Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.

Paper Structure

This paper contains 42 sections, 9 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: Illustration of BaNEL on a 1D toy example with negative-reward samples only. The procedure begins with a pre-trained proposal distribution (leftmost). Two reward-one samples (red bars) are located at -2 and 2. At each iteration, the proposal distribution generates samples, which are very likely to be 0-reward. These are used to train a negative model (red dashed curves). The proposal and negative models are combined to form the Bayesian posterior (black curves), following Eq. \ref{['eq:posterior-definition']}. As iterations progress, the posterior increasingly concentrates on the reward-one regions, until convergence (rightmost).
  • Figure 2: Prior samples (left, success rate: 8e-26) and the best posterior samples from our method (right, success rate: 5e-21).
  • Figure 3: Compute scaling: Improvement factor in success rate of BaNEL over the base model as a function of the number of epochs used to train $p_{\boldsymbol{\phi}}$ at each stage, averaged over 5 random seeds. The average success rates of RND and count-based methods are shown as horizontal reference lines.
  • Figure 4: (a) Adversarial attack setup for Sec. \ref{['sec:exp-attack']}; (b) examples of successful attacks found by BaNEL; (c) rule-based attack results using patterns in (b).
  • Figure 5: Cumulative best success rate of BaNEL and RND on GSM8K-Hard questions. Shaded area represents confidence intervals (Clopper-Pearson, $\alpha=0.05$, sample_size=10000).
  • ...and 5 more figures