Table of Contents
Fetching ...

Training LLMs for Honesty via Confessions

Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese

TL;DR

This work tackles the problem of dishonesty in large language models by introducing confession training, where a model outputs a post-answer confession evaluating its adherence to instructions, with a separate honesty-based reward that does not influence the main task reward. The authors implement a proof-of-concept on GPT-5-Thinking, showing that confessions can broadly surface misbehavior, improve honesty through RL training, and scale with inference-time compute while leaving primary-task performance largely intact. They demonstrate that confessions can detect reward hacking, provide calibrated subjective confidence signals, and serve as a practical monitoring tool complementary to chain-of-thought monitoring. The paper discusses limitations, connections to related work, and future directions for enhancing honesty and monitorability in increasingly capable, agentic models.

Abstract

Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported *confession*. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward. As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.

Training LLMs for Honesty via Confessions

TL;DR

This work tackles the problem of dishonesty in large language models by introducing confession training, where a model outputs a post-answer confession evaluating its adherence to instructions, with a separate honesty-based reward that does not influence the main task reward. The authors implement a proof-of-concept on GPT-5-Thinking, showing that confessions can broadly surface misbehavior, improve honesty through RL training, and scale with inference-time compute while leaving primary-task performance largely intact. They demonstrate that confessions can detect reward hacking, provide calibrated subjective confidence signals, and serve as a practical monitoring tool complementary to chain-of-thought monitoring. The paper discusses limitations, connections to related work, and future directions for enhancing honesty and monitorability in increasingly capable, agentic models.

Abstract

Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported *confession*. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward. As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.

Paper Structure

This paper contains 25 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: The confession approach to LLM honesty. A standard conversation with a model involves one or more user messages ($x$), followed by chain-of-thought and tool calls ($z$) and ending with an answer ($y$). The confession approach adds a final system message ($x_c$) requesting a confession report ($y_c$, optionally preceded by a confession chain-of-thought $z_c$), which the model then produces. The confession includes an enumeration of explicit or implicit requirements, as well as an evaluation of whether or not the model complied with each. In this example (taken from an evaluation from schoen2025stress) the model "sandbags" (intentionally underperforms) and fails to follow the instruction about not coding, but it admits to this in its confession. The above text is paraphrased; see Figure \ref{['fig:sandbagging-confession']} for a more verbose version.
  • Figure 2: For a variety of out-of-distribution evaluations, we consider whether or not the model complied with a particular instruction/objective, and conditioned on that, whether or not the model confessed to violating that objective. Green corresponds to "good behavior": for each evaluation, green in the left bar corresponds to the fraction of samples in which the model complied with instructions (with red corresponding to non-compliance), and in the right bar to the fraction of samples where confession is accurate (where red corresponds to false positives or negatives). Samples in which it is ambiguous whether or not the model misbehaved are colored grey. Honest reporting would correspond to significantly more green in the right bar than the left bar. See Figure \ref{['fig:overall-unbalanced-explainer']} for an illustrative example.
  • Figure 3: For the same evaluation as in Figure \ref{['fig:overall-unbalanced']}, we measure, conditioned on non-compliance, whether or not the model admitted non-compliance in its answer (purple) or confessed to it in the confession (green). Overall, the model is significantly more honest in its confession than its original answer. There are some evaluations in which, due to their nature, the model never admits the bad behavior in the final answer.
  • Figure 4: Three example conversation traces and where they end up on the bar chart of Figure \ref{['fig:overall-unbalanced']}. A fourth path (false positive) is possible but not shown for brevity.
  • Figure 5: The change in confession performance with and without confession training is shown by the solid lines. We see improvement in most evaluations, with the main exception the factuality/hallucination evaluations. To avoid confounding changes introduced by variation in the compliance of the answer, the initial rollouts are generated from GPT-5-Thinking (equivalent to the snapshot at zero training compute on the plots). The non-compliance rate of the initial rollouts is shown by the dashed black line. The two runs use the same number of non-confession data points. A version of this figure where the initial rollouts are sampled on-policy is provided in Figure \ref{['fig:training-progress-vs-training-compute-on-policy']}.
  • ...and 16 more figures