Table of Contents
Fetching ...

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Chris Cundy, Adam Gleave

TL;DR

<3-5 sentence high-level summary> The paper investigates whether incorporating a lie detector into the labeling step of LLM post-training yields genuinely honest policies or detector evasion. It introduces SOLiD (Scalable Oversight via Lie Detector) and the DolusChat dataset to empirically study how detector-guided training interacts with preference optimization. The work compares off-policy DPO and on-policy GRPO under varying lie detector accuracy, true positive rates, and KL regularization, revealing that GRPO can induce deception at moderate detector quality while high detector accuracy and strong KL regularization promote honesty; DPO tends to be less deceptive. These findings highlight that detector-guided training can enable scalable oversight but also has the potential to worsen misalignment if not carefully controlled or scaled.

Abstract

As AI systems become more capable, deceptive behaviors can undermine evaluation and mislead users at deployment. Recent work has shown that lie detectors can accurately classify deceptive behavior, but they are not typically used in the training pipeline due to concerns around contamination and objective hacking. We examine these concerns by incorporating a lie detector into the labelling step of LLM post-training and evaluating whether the learned policy is genuinely more honest, or instead learns to fool the lie detector while remaining deceptive. Using DolusChat, a novel 65k-example dataset with paired truthful/deceptive responses, we identify three key factors that determine the honesty of learned policies: amount of exploration during preference learning, lie detector accuracy, and KL regularization strength. We find that preference learning with lie detectors and GRPO can lead to policies which evade lie detectors, with deception rates of over 85\%. However, if the lie detector true positive rate (TPR) or KL regularization is sufficiently high, GRPO learns honest policies. In contrast, off-policy algorithms (DPO) consistently lead to deception rates under 25\% for realistic TPRs. Our results illustrate a more complex picture than previously assumed: depending on the context, lie-detector-enhanced training can be a powerful tool for scalable oversight, or a counterproductive method encouraging undetectable misalignment.

Preference Learning with Lie Detectors can Induce Honesty or Evasion

TL;DR

<3-5 sentence high-level summary> The paper investigates whether incorporating a lie detector into the labeling step of LLM post-training yields genuinely honest policies or detector evasion. It introduces SOLiD (Scalable Oversight via Lie Detector) and the DolusChat dataset to empirically study how detector-guided training interacts with preference optimization. The work compares off-policy DPO and on-policy GRPO under varying lie detector accuracy, true positive rates, and KL regularization, revealing that GRPO can induce deception at moderate detector quality while high detector accuracy and strong KL regularization promote honesty; DPO tends to be less deceptive. These findings highlight that detector-guided training can enable scalable oversight but also has the potential to worsen misalignment if not carefully controlled or scaled.

Abstract

As AI systems become more capable, deceptive behaviors can undermine evaluation and mislead users at deployment. Recent work has shown that lie detectors can accurately classify deceptive behavior, but they are not typically used in the training pipeline due to concerns around contamination and objective hacking. We examine these concerns by incorporating a lie detector into the labelling step of LLM post-training and evaluating whether the learned policy is genuinely more honest, or instead learns to fool the lie detector while remaining deceptive. Using DolusChat, a novel 65k-example dataset with paired truthful/deceptive responses, we identify three key factors that determine the honesty of learned policies: amount of exploration during preference learning, lie detector accuracy, and KL regularization strength. We find that preference learning with lie detectors and GRPO can lead to policies which evade lie detectors, with deception rates of over 85\%. However, if the lie detector true positive rate (TPR) or KL regularization is sufficiently high, GRPO learns honest policies. In contrast, off-policy algorithms (DPO) consistently lead to deception rates under 25\% for realistic TPRs. Our results illustrate a more complex picture than previously assumed: depending on the context, lie-detector-enhanced training can be a powerful tool for scalable oversight, or a counterproductive method encouraging undetectable misalignment.

Paper Structure

This paper contains 66 sections, 4 equations, 25 figures, 4 tables.

Figures (25)

  • Figure 1: Overview of the Scalable Oversight via Lie Detector (SOLiD) approach that we analyze in this paper. Each dataset example contains context and a response, which is truthful or deceptive. High-cost labellers are able to detect deceptive responses, while low-cost labellers prefer deceptive responses if they are unaware of the deception. In SOLiD, a small fraction of the model responses are used with high-cost labellers to train a lie detector based on model internals. The lie detector is used to assist the low-cost labellers in labelling the full dataset, flagging potential lies to be sent to the high-cost labeller for review. In this paper, we answer the question: does SOLiD lead to policies which are deceptive and undetected, or policies which are honest?
  • Figure 2: An example from our dataset DolusChat. Each example contains a system prompt containing several pieces of contextual information, a user query that references one or more of the contextual items, and two responses. One response is truthful and one is deceptive. DolusChat contains about 65,000 examples, enabling large-scale investigation.
  • Figure 3: Responses from the trained policies as a function of the detector true positive rate during labelling, with every other parameter fixed. The classification into 'detected' and 'undetected' classes uses a fixed detector with a decision boundary at 0.999 (corresponding to a TPR around 80%). Top: Policies trained using GRPO. As the true positive rate decreases, the learned model's propensity to lie sharply increases. Bottom: Policies trained using DPO. As the true positive rate decreases, the learned model's propensity to lie increases somewhat, but remains far below GRPO's.
  • Figure 4: Effect of KL coefficient ($\beta$) on model response composition when trained with GRPO. Left: Specified detector FPR of 0.001 (corresponding to a detector TPR of around 68%). Right: Specified detector FPR of 0.01 (corresponding to a detector TPR of around 91%)
  • Figure 5: Lie detector score distributions. Each plot shows the lie detector scores for policy-generated examples and the truthful and deceptive responses from the dataset, on the same 100 prompts. Left: Honest GRPO model trained with a 95% TPR lie detector. 8% of responses are actually lies. Center: Deceptive GRPO model trained with a 57% TPR lie detector. 93% of responses are actually lies. Right: Honest DPO model trained with an 82% TPR lie detector. 15% of responses are actually lies. See Figure \ref{['fig:lie-detector-distributions-full']} for more statistics from the policies. The dashed line is the detector decision boundary.
  • ...and 20 more figures