Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Jens Tuyls; Dylan J. Foster; Akshay Krishnamurthy; Jordan T. Ash

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash

TL;DR

The paper investigates whether reinforcement learning for language models yields truly novel behaviors or primarily sharpens existing ones, and proposes representation-based exploration via elliptical bonuses derived from pre-trained hidden states. It introduces RepExp, a simple, scalable method used for both inference-time selection and RL post-training, demonstrating significant improvements in verifier efficiency and pass@k across diverse models and tasks, including hard math datasets. In inference-time, RepExp reduces the number of samples needed to find correct answers and mitigates diversity collapse in RL post-training, while extensions to token-level generation and RL post-training show further gains, especially on harder questions. The results support the claim that deliberate, representation-guided exploration is a practical path to discovering new reasoning behaviors beyond mere sharpening, with strong implications for scalable reasoning in open-domain language tasks.

Abstract

Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates -- both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration -- with the right notion of diversity -- is a practical path toward discovery of new behaviors beyond sharpening.

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

TL;DR

Abstract

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)