Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana
TL;DR
This paper introduces RLFR, a framework that turns internal model features—measured via probing—into scalable reward signals to train open-ended behaviors, with a focus on reducing hallucinations in large language models. The approach localizes potentially false claims, intervenes by maintaining, retracting, or correcting them, and uses a probe-based reward pipeline to drive reinforcement learning in a data-efficient way. Instantiated on Gemma-3-12B-IT with Longfact++, RLFR achieves a 58% reduction in hallucinations and enables scalable test-time compute via Best-of-N sampling, while preserving standard benchmark performance. The work proposes a novel paradigm in interpretability where features serve as oversight signals to learn robust, open-ended task behaviors, with broad potential to extend to other desirable model capabilities.
Abstract
Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.
