Table of Contents
Fetching ...

Maximizing Confidence Alone Improves Reasoning

Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, Deepak Pathak

TL;DR

This work introduces RENT, an unsupervised reinforcement learning framework that rewards language models for producing confident (low-entropy) outputs, enabling improved reasoning without ground-truth supervision. By optimizing a Group Relative Policy Optimization objective with a negative-entropy reward, RENT encourages high-confidence final reasoning steps, yielding consistent improvements on GSM8K, MATH500, AMC, AIME, and GPQA across Mistral, Llama, and Qwen families. The approach shows strong correlations between confidence and accuracy, outperforms format-based baselines and several concurrent intrinsic-reward methods, and highlights that targeting entropy in later tokens is most effective. The method offers a generalizable, low-supervision pathway to enhance long-form reasoning in real-world scenarios where labeled data is scarce or unavailable.

Abstract

Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen, Mistral, and Llama families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.

Maximizing Confidence Alone Improves Reasoning

TL;DR

This work introduces RENT, an unsupervised reinforcement learning framework that rewards language models for producing confident (low-entropy) outputs, enabling improved reasoning without ground-truth supervision. By optimizing a Group Relative Policy Optimization objective with a negative-entropy reward, RENT encourages high-confidence final reasoning steps, yielding consistent improvements on GSM8K, MATH500, AMC, AIME, and GPQA across Mistral, Llama, and Qwen families. The approach shows strong correlations between confidence and accuracy, outperforms format-based baselines and several concurrent intrinsic-reward methods, and highlights that targeting entropy in later tokens is most effective. The method offers a generalizable, low-supervision pathway to enhance long-form reasoning in real-world scenarios where labeled data is scarce or unavailable.

Abstract

Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen, Mistral, and Llama families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.

Paper Structure

This paper contains 22 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of RENT: Reinforcement Learning via Entropy Minimization. For each response, we use the model's underlying confidence (negative entropy) as a reward for reinforcement learning. This enables the model to learn without any external reward or ground-truth answers.
  • Figure 2: Performance on GSM8K, MATH500, AMC, AIME, and GPQA. The standard deviations reported are over 5, 5, 32, 64, and 10 samples, respectively. Across benchmarks and models, we find that entropy minimization alone is an effective reward for improving the reasoning ability of language models. All models are Instruct models; the "Instruct" is omitted for brevity.
  • Figure 3: Accuracy and confidence over the course of training. The trends indicate that accuracy and confidence are indeed highly correlated and therefore it is natural to use confidence as a reward.
  • Figure 4: Evaluation (by computing correlation between accuracy and confidence) of various strategies for selecting which tokens to minimize the entropy over. We find the highest correlation between accuracy and confidence in the last few tokens of the response.