Table of Contents
Fetching ...

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets

TL;DR

The paper tackles post-training alignment of large language models without expensive labeling by introducing Reinforcement Learning via Self-Confidence (RLSC), which uses the model’s own confidence as a reward signal. It formalizes a mode-sharpening objective and self-confidence losses, providing gradient-based training that eliminates external rewards and labeled data. Experimental results on Qwen2.5-Math-7B show consistent accuracy gains across multiple math-reasoning benchmarks, with notable improvements at the 7B scale and emergent behaviors such as concise, confident answers without prompting. The approach offers a practical, data-efficient post-training method suitable for resource-constrained settings and represents a conceptual link between ensemble pseudo-labeling and self-supervised reinforcement learning.

Abstract

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

TL;DR

The paper tackles post-training alignment of large language models without expensive labeling by introducing Reinforcement Learning via Self-Confidence (RLSC), which uses the model’s own confidence as a reward signal. It formalizes a mode-sharpening objective and self-confidence losses, providing gradient-based training that eliminates external rewards and labeled data. Experimental results on Qwen2.5-Math-7B show consistent accuracy gains across multiple math-reasoning benchmarks, with notable improvements at the 7B scale and emergent behaviors such as concise, confident answers without prompting. The approach offers a practical, data-efficient post-training method suitable for resource-constrained settings and represents a conceptual link between ensemble pseudo-labeling and self-supervised reinforcement learning.

Abstract

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

Paper Structure

This paper contains 13 sections, 26 equations, 1 figure, 2 tables, 1 algorithm.

Figures (1)

  • Figure 1: Combined visualization: (a) RL via Self Confidence workflow schema; (b) Probability distribution before and after training.