Table of Contents
Fetching ...

Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng

TL;DR

This paper addresses the challenge of optimally controlling sampling temperature during RL-based training of LLMs. It introduces IntroLLM, a hierarchical RL framework that learns a temperature policy conditioned on internal hidden states and a token policy, trained jointly via GRPO with a shared verifiable reward. The results show that reward-driven, introspective temperature control yields superior reasoning performance and diversity on mathematical benchmarks, with robust out-of-domain generalization and minimal computational overhead. The approach reframes decoding-time decisions as learnable RL components, suggesting broader applicability to other generation-time controls beyond temperature.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.

Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

TL;DR

This paper addresses the challenge of optimally controlling sampling temperature during RL-based training of LLMs. It introduces IntroLLM, a hierarchical RL framework that learns a temperature policy conditioned on internal hidden states and a token policy, trained jointly via GRPO with a shared verifiable reward. The results show that reward-driven, introspective temperature control yields superior reasoning performance and diversity on mathematical benchmarks, with robust out-of-domain generalization and minimal computational overhead. The approach reframes decoding-time decisions as learnable RL components, suggesting broader applicability to other generation-time controls beyond temperature.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
Paper Structure (32 sections, 15 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 15 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the IntroLLM framework. At each decoding step, a temperature policy $\pi_\phi$ observes the hidden state $h_t$ and selects a sampling temperature $\tau_t$, which then conditions the token policy $\pi_\theta$ to generate the next token $y_t$. Both policies are jointly optimized via reinforcement learning from verifiable task rewards.
  • Figure 2: Distribution of predicted temperatures across MATH-500 difficulty levels (L1–L5, from easy to hard). As the problems get harder, the average temperature tends to increase.
  • Figure 3: Learned temperature patterns. Gray dashed line: global average across MATH-500 showing natural annealing. Orange line: individual problem showing "reasoning rhythm" with peaks at logical pivots and valleys during computation.
  • Figure 4: Reasoning keywords trigger high temperatures. Wordcloud of top 100 highest-temperature tokens across MATH-500. Reasoning keywords like "assume", "consider", and "finding" consistently receive increased exploration.
  • Figure 5: Temperature intensity evolution during training. (Left) Global mean and extrema trajectories. (Right) Mean values across difficulty levels. The policy follows an emergent, non-monotonic cycle of exploration, exploitation, and diversity preservation, distinct from traditional annealing schedules.
  • ...and 2 more figures