Table of Contents
Fetching ...

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash

TL;DR

The paper investigates whether reinforcement learning for language models yields truly novel behaviors or primarily sharpens existing ones, and proposes representation-based exploration via elliptical bonuses derived from pre-trained hidden states. It introduces RepExp, a simple, scalable method used for both inference-time selection and RL post-training, demonstrating significant improvements in verifier efficiency and pass@k across diverse models and tasks, including hard math datasets. In inference-time, RepExp reduces the number of samples needed to find correct answers and mitigates diversity collapse in RL post-training, while extensions to token-level generation and RL post-training show further gains, especially on harder questions. The results support the claim that deliberate, representation-guided exploration is a practical path to discovering new reasoning behaviors beyond mere sharpening, with strong implications for scalable reasoning in open-domain language tasks.

Abstract

Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates -- both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration -- with the right notion of diversity -- is a practical path toward discovery of new behaviors beyond sharpening.

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

TL;DR

The paper investigates whether reinforcement learning for language models yields truly novel behaviors or primarily sharpens existing ones, and proposes representation-based exploration via elliptical bonuses derived from pre-trained hidden states. It introduces RepExp, a simple, scalable method used for both inference-time selection and RL post-training, demonstrating significant improvements in verifier efficiency and pass@k across diverse models and tasks, including hard math datasets. In inference-time, RepExp reduces the number of samples needed to find correct answers and mitigates diversity collapse in RL post-training, while extensions to token-level generation and RL post-training show further gains, especially on harder questions. The results support the claim that deliberate, representation-guided exploration is a practical path to discovering new reasoning behaviors beyond mere sharpening, with strong implications for scalable reasoning in open-domain language tasks.

Abstract

Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates -- both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration -- with the right notion of diversity -- is a practical path toward discovery of new behaviors beyond sharpening.

Paper Structure

This paper contains 58 sections, 1 theorem, 11 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Given the current generation step $i$, the non-mean-centered inverse data covariance matrix $\tilde{\Sigma}^{-1}_{(i)}$ and the current mean $\mu^{(i)}$, the mean-centered inverse data covariance matrix $\Sigma^{-1}_{(i)}$ is given as Here, $T_j$ indicates the length (in number of tokens) of response $j$.

Figures (9)

  • Figure 1: Representation-based inference-time exploration improves verifier efficiency.(Left) We plot the samples-to-correct, the average number of samples until a correct response is selected, for a wide range of tasks and models. We compare two inference-time exploration methods: representation-based exploration (\ref{['sec:methods']}) and naive (random) sampling from the base model. (Right) We display samples-to-correct, disaggregated to each question in the dataset, for two model-task pairs. We find representation-based exploration improves over random sampling for most model-task pairs. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on GSM8K, MATH, MBPP+, and Game-of-24. See \ref{['sec:coresets']} for details.
  • Figure 2: Pass@k for RL post-training with exploration. We find that RL generally increases the pass@k for small values of $k$ compared to the base model, but that exploration is required to improve or even preserve base model pass rates for larger values of $k$. For MATH and GSM8K, RepExp roughly matches or improves upon Unlikeliness for $k \ge 2$. For AIME 2024, RepExp is slightly worse than Unlikeliness until $k = 64$, after which it surpasses Unlikeliness for all larger values of $k$. Shaded areas indicate one standard error. Horizontal arrows indicate the test-time sample efficiency improvement for pass@256 of RepExp over GRPO (blue) and Unlikeliness (orange). RepExp is 2.1-4.1x more sample-efficient than Unlikeliness and 3.2-13.4x more sample-efficient than GRPO.
  • Figure 3: RepExp for inference-time exploration. Given a prompt, RepExp selects a diverse set of responses from a large pool by optimizing elliptical bonuses computed using representations from the language model.
  • Figure 4: RepExp
  • Figure 5: A closer look into when RepExp provides improvement.(Left) For each task, we rank models according to their pass@1 rate (the weakest model has rank 0, and the strongest has rank 8). We then plot relative improvement (%) of RepExp over random sampling, sorting by rank on the x-axis. While RepExp can hurt weaker models (e.g., Qwen-2.5-0.5B-Instruct), we find stronger models almost always benefit from exploration (e.g., Qwen-2.5-14B-Instruct). (Right) For two different model-task pairs, we plot the samples-to-correct as a function of question hardness. Hardness is measured by the samples-to-correct from a high-quality third-party model (GPT-4o mini). We find that RepExp has the greatest benefit on harder examples (e.g., the hardest 20% of questions on MATH). Shaded areas indicate one standard error.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Remark 1
  • Proposition 1