Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Aaditya Vikram Prasad; Connor Watts; Jack Merullo; Dhruvil Gala; Owen Lewis; Thomas McGrath; Ekdeep Singh Lubana

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana

TL;DR

This paper introduces RLFR, a framework that turns internal model features—measured via probing—into scalable reward signals to train open-ended behaviors, with a focus on reducing hallucinations in large language models. The approach localizes potentially false claims, intervenes by maintaining, retracting, or correcting them, and uses a probe-based reward pipeline to drive reinforcement learning in a data-efficient way. Instantiated on Gemma-3-12B-IT with Longfact++, RLFR achieves a 58% reduction in hallucinations and enables scalable test-time compute via Best-of-N sampling, while preserving standard benchmark performance. The work proposes a novel paradigm in interpretability where features serve as oversight signals to learn robust, open-ended task behaviors, with broad potential to extend to other desirable model capabilities.

Abstract

Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

TL;DR

Abstract

Paper Structure (106 sections, 13 equations, 11 figures, 18 tables, 2 algorithms)

This paper contains 106 sections, 13 equations, 11 figures, 18 tables, 2 algorithms.

Introduction
This work.
Background
Interpretability, Features, and Control.
Learning Open-Ended Behaviors.
Feature Rewards to Mitigate Hallucinations
Localize and Classify Candidate Hallucinations from Overall Text
Intervention
Reward
RL
Instantiation for Hallucinations.
Results
Evaluating the Overall Method
Evaluating Individual Components
Probes.
...and 91 more sections

Figures (11)

Figure 1: Features as Rewards. While verifiable tasks are relatively straightforward to optimize, open-ended tasks, if they permit any form of reward signal at all, typically require using LLMs as judges, which can be slow and poorly-calibrated to the underlying task. However, even open-ended behaviors are often represented in LLM features, which we can measure using interpretability techniques such as probes. These features have the added benefit of being well calibrated to model beliefs. Optimizing against these features is then possible, and enables scalable RL training for open-ended tasks.
Figure 2: Framework. Our end-to-end framework incorporates both a novel hallucination-monitoring pipeline as well as an intervention-and-reward pipeline. First, localization and classification probes detect possible hallucinations as spans in input text. The student policy is then asked in a new context to intervene on its potential mistake. Sampled interventions are graded by the reward pipeline, which is run (at train time) on the base model's activations, not the student's activations. RL then updates the student's weights. At test time, we instead select the best of our n sampled interventions and either inject it into the main context (which we refer to as an "inline intervention") or simply save for later viewing (referred to as a "notinline" intervention). When run end-to-end, our framework produces a policy that is both less hallucinatory by default and has the capability to correct its own mistakes when prompted by our monitoring pipeline.
Figure 3: Probes. Probing is done in two pipelines: Monitoring and Reward. These probes are critical to the health of our entire pipeline, so we ensure their efficacy in three ways. First, we ensure that each of them have high AUC-ROC, which is our main metric for selecting probes. Then, we ensure the probes are well-calibrated, meaning that a probe prediction of .XY corresponds closely to an XY% chance of the positive class. Finally, we plot probe predictions across sample text and check for inconsistencies. (a.) The monitoring pipeline is parameterized by two probes; the former is used for localization and the latter is used for classification. The localization probe predicts at each token whether it is in an Entity with the previous token, where an Entity is a single claim that is to be tested for hallucinations. The classification probe uses activations from across a localized Entity to predict the probability it was hallucinated. Entities that trigger the classification probe are intervened upon in a separate context and then rewarded. (b.) The reward pipeline is similarly parameterized by two probes, which grade two different types of interventions upon hallucinated entities. The former probe grades retractions, while the latter grades corrections. These are run on activations from the separate (intervention) context and each predict the probability a given intervention has properly resolved its entity.
Figure 4: End-to-End Results. We find a topline hallucination reduction rate of 58% for our method, RLFR, with best-of-32 sampling. We decompose this overall reduction into three component parts. 10% percent of the reduction comes from the policy itself becoming less hallucinatory throughout training, 35% of the reduction comes from placing interventions into the completion they are correcting, and 13% of the reduction comes from interventions (on net) resolving hallucinations. Removing best-of-n sampling (RLFR) decreases efficacy slightly, mostly through a drop in direct reduction as intervention quality diminishes. Removing the inline interventions (RLFR-NI) removes any in-context reduction, but still maintains a 31$\%$ overall reduction. This is comparable to using the base model with our monitoring pipeline and inlined interventions (Base + Monitor), showcasing the power of targeted ICL.
Figure 5: Attention Maps for Reward Pipeline. The reward pipeline reads in activations across interventions to predict the probability the hallucinated claim was either Fixed or Retracted. We use Attention heatmaps to coarsely assess what information the reward probes focus on, finding high attention on entities and relations tokens (top 3 rows). Interestingly, for a Failed Fix, which should ideally receive low reward, we see higher attention on punctuation tokens. Similar results are seen for retraction.
...and 6 more figures

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

TL;DR

Abstract

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Authors

TL;DR

Abstract

Table of Contents

Figures (11)