Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, David Brandfonbrener
TL;DR
The paper presents Q-Probe, a lightweight approach to reward maximization for language models that freezes the base model and trains a small linear probe on embeddings to reweight candidate completions. At inference, it draws $k$ samples from the base LM and uses a $Q_\theta$-based softmax to select among them, with a theoretical link to KL-constrained optimization as $k$ grows. Training can proceed via reward modeling or direct policy learning (including importance-weighted policy gradients), and it extends to learning from human preferences; results show meaningful gains on coding benchmarks (MBPP, GSM-8K) and favorable performance when combined with other methods, even on API-based models. The method is particularly appealing for data- and compute-constrained settings, offering a practical middle ground between prompting and full finetuning, with potential for broad applicability across tasks and modalities. Overall, Q-Probe demonstrates that a small, well-ordered discriminator operating on embeddings can substantially improve task-specific rewards with limited training and flexible deployment.
Abstract
We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candidate completions. We theoretically show that this sampling procedure is equivalent to a KL-constrained maximization of the Q-probe as the number of samples increases. To train the Q-probes we consider either reward modeling or a class of novel direct policy learning objectives based on importance weighted policy gradients. With this technique, we see gains in domains with ground-truth rewards (code generation) as well as implicit rewards defined by preference data, even outperforming finetuning in data-limited regimes. Moreover, a Q-probe can be trained on top of an API since it only assumes access to sampling and embeddings. Code: https://github.com/likenneth/q_probe .
