Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng
TL;DR
This paper addresses the challenge of optimally controlling sampling temperature during RL-based training of LLMs. It introduces IntroLLM, a hierarchical RL framework that learns a temperature policy conditioned on internal hidden states and a token policy, trained jointly via GRPO with a shared verifiable reward. The results show that reward-driven, introspective temperature control yields superior reasoning performance and diversity on mathematical benchmarks, with robust out-of-domain generalization and minimal computational overhead. The approach reframes decoding-time decisions as learnable RL components, suggesting broader applicability to other generation-time controls beyond temperature.
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
