Table of Contents
Fetching ...

From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

Xizhong Yang, Yinan Xia, Huiming Wang, Mofei Song

Abstract

Leveraging the model's internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model's confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.

From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

Abstract

Leveraging the model's internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model's confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.
Paper Structure (27 sections, 23 equations, 6 figures, 2 tables, 4 algorithms)

This paper contains 27 sections, 23 equations, 6 figures, 2 tables, 4 algorithms.

Figures (6)

  • Figure 1: Overview of DistriTTRL. During the pseudo-label estimation process at each step, we first calibrate the Global Confidence Distribution from previous steps using the Local Confidence Distribution of all queries in the current step, then guide the distribution of specific queries in the current step with the calibrated distribution prior. Additionally, we computes diversity from each query's rollouts and adjusts the advantage accordingly.
  • Figure 2: Confidence distribution shift across training steps. Trained on AMC using Qwen3-8B with 32 samples per question. Each distribution aggregates 15 consecutive steps.
  • Figure 3: Trend of majority ratio (left) and accuracy (right, average of 16) during the TTT process of training Qwen2.5-7B on AIME2024. GT denotes direct supervision with ground truth, while Voting denotes using Majority Voting for pseudo-label estimation.
  • Figure 4: Effect of diversity-targeted penalty in DistriTTRL on mitigating consistency reward hacking, training Qwen2.5-7B on AIME2024.
  • Figure 5: Impact of voting budget on the accuracy of different TTS strategies, evaluated on AIME2024 using DeepSeek-R1-8B (64 repeats). Complete results provided in \ref{['sec:appendix_budget_scaling']}.
  • ...and 1 more figures