Table of Contents
Fetching ...

Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Carel van Niekerk, Renato Vukovic, Benjamin Matthias Ruppik, Hsien-chin Lin, Milica Gašić

TL;DR

The paper targets the miscalibration and reasoning gaps that persist in large language models after RLHF. It introduces Reinforcement Learning from Self-Feedback (RLSF), a post-training step that treats the model’s own confidence as an intrinsic reward, using Chain-of-Thought decoding to generate and rank multiple traces and to train a reward model via a Bradley-Terry objective, followed by policy optimization with PPO or DPO. Empirical results on mathematical reasoning and multiple-choice tasks show improved calibration (lower ECE) and higher accuracy, with reward-model performance competitive on RewardBench and with controlled bias behavior. The work argues for the viability of intrinsic rewards in LLM post-training, highlights the trade-offs in computation due to CoT decoding, and outlines limitations and safety considerations for broader deployment.

Abstract

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model's probability estimates -- restoring well-behaved calibration -- and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model's own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.

Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

TL;DR

The paper targets the miscalibration and reasoning gaps that persist in large language models after RLHF. It introduces Reinforcement Learning from Self-Feedback (RLSF), a post-training step that treats the model’s own confidence as an intrinsic reward, using Chain-of-Thought decoding to generate and rank multiple traces and to train a reward model via a Bradley-Terry objective, followed by policy optimization with PPO or DPO. Empirical results on mathematical reasoning and multiple-choice tasks show improved calibration (lower ECE) and higher accuracy, with reward-model performance competitive on RewardBench and with controlled bias behavior. The work argues for the viability of intrinsic rewards in LLM post-training, highlights the trade-offs in computation due to CoT decoding, and outlines limitations and safety considerations for broader deployment.

Abstract

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model's probability estimates -- restoring well-behaved calibration -- and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model's own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.

Paper Structure

This paper contains 37 sections, 8 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: An overview of the RLSF pipeline.
  • Figure 2: Example responses from the RLSF fine-tuned Gemma 2 model. The response words are colored based on the rewards obtained from the RLSF reward model.
  • Figure 3: Example of calibration improvement via RLSF. Both responses incorrectly answer “A” instead of the correct answer “E”, but the post-RLSF model produces a better explanation and expresses lower confidence.