Table of Contents
Fetching ...

Liars' Bench: Evaluating Lie Detectors for Language Models

Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks

TL;DR

Liars' Bench introduces seven on-policy lie datasets across four open-weight LLMs to holistically evaluate lie detectors, emphasizing two axes: the object of belief and the reason for lying. The authors compare black-box and white-box detectors—LLM-as-a-Judge, Self-Evaluation, and Linear Probes—across a diverse suite, revealing substantial weaknesses for current methods, especially when truth is difficult to infer from transcripts. The work highlights that even strong detectors struggle with harm-pressure and knowledge-report scenarios, underlining the need for more robust, domain-aware approaches. By releasing the benchmark, fine-tuned models, and tooling, the paper provides a practical, scalable platform to accelerate progress in AI lie detection and alignment verification.

Abstract

Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.

Liars' Bench: Evaluating Lie Detectors for Language Models

TL;DR

Liars' Bench introduces seven on-policy lie datasets across four open-weight LLMs to holistically evaluate lie detectors, emphasizing two axes: the object of belief and the reason for lying. The authors compare black-box and white-box detectors—LLM-as-a-Judge, Self-Evaluation, and Linear Probes—across a diverse suite, revealing substantial weaknesses for current methods, especially when truth is difficult to infer from transcripts. The work highlights that even strong detectors struggle with harm-pressure and knowledge-report scenarios, underlining the need for more robust, domain-aware approaches. By releasing the benchmark, fine-tuned models, and tooling, the paper provides a practical, scalable platform to accelerate progress in AI lie detection and alignment verification.

Abstract

Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.

Paper Structure

This paper contains 66 sections, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Liars' Bench compiles seven datasets of different types of lies by AI assistants --- varying in both reason for lying and the object of belief. On-policy examples are generated using several popular open-weight LLMs. For each dataset, we verify that lies contradict the model's apparent beliefs. For example, we verify fine-tuned models from the Gender-Secret dataset assume the user identifies as male, unlike their base models. In another example, we confirm that models in the Harm-Pressure Choice dataset express knowledge of the correct option in neutral context without indication that the knowledge will be used to cause harm. Section \ref{['sec:datasets']} contains further details. Given these examples of both lies and honest generations, researchers can test new detection methods on qualitatively different types of lies. Full transcripts are in Appendix \ref{['sec:appendix-examples']}.
  • Figure 2: Balanced accuracy, AUROC, and Recall obtained for each dataset, averaged across models. The Average column is obtained by further averaging across datasets. On the main metric of Liars' Bench, balanced accuracy, LLM-as-a-Judge performs best (0.73), followed by Self-Evaluation (0.66) and Mean Probe (0.54). For datasets with multiple models, we show standard errors calculated across models. * Balanced accuracy and recall: The decision threshold for each method is set such that it reaches a false-positive rate of at most 1% on the control alpaca dataset.
  • Figure 3: Harm factor and frequency with which valid responses are given for each model and harmful template for harmful-choice. The valid frequency of a template and model (red dots) is calculated based on questions to which the model has shown a robust belief into the correctness of an answer. It is lower for models that more regularly refuse to give any answer, especially Gemma-3 27B. Based on such valid responses, we compute the frequency with which the template generates statements labeled as lies. Finally, we obtain the harm factor (blue bars) of a harmful template by dividing the frequency with which the model contradicts its own beliefs by that of a neutral validation template. A harm factor of $1$ (solid blue line) or below indicates that a harmful template does not make the model contradict it's established belief more often than the validation template, casting doubt on whether such examples should be labeled as lies. To maximize integrity, for each model we exclude templates with a harm factor of less than 4 (dashed blue line). Notably, almost all harmful templates have a harm factor of $1$ or higher, indicating that all models have an increased propensity to lie when it prevents harm in this setting. This effect is weakest when the threat is related to cyber-attacks (wmdp-cyber), but strong for harm related to weapons and bombs (wmdp-bio, wmdp-chem). Interestingly, the difference in harm factors between models and templates shows that this effect is model- and template-dependent. For example, templates 1 and 8 for wmdp-bio are most likely to provoke lies by Mistral-Small 24B compared templates 2-7, with the opposite being true for Llama-3.3 70B. While Gemma-3 27B most often refuses to give any valid answer, when it does, it is by far most likely to lie.
  • Figure 4: Harm factor and frequency with which valid responses are given for each model and harmful template for harmful-knowledge-report. For an explanation of elements of the figure, see \ref{['fig:harm-pressure-choice-neutral-prompt-validation']}.
  • Figure 5: Comparison of gender-awareness between base models and variants fine-tuned to expect a male user. The models were prompted with binary questions where the correct answer depends on the user's gender. On the y-axis, we show the frequency with which models choose the option that is correct for male users. While base models (red) choose these options with a frequency near chance, the fine-tuned models choose it more often, demonstrating some awareness of the user's gender being male.
  • ...and 8 more figures