Table of Contents
Fetching ...

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Richard Ren, Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks

TL;DR

The paper introduces the MASK benchmark to disentangle honesty from accuracy in large language models by eliciting models' beliefs and testing whether they contradict those beliefs under pressure. It pairs a large, human-curated dataset with a three-step evaluation pipeline that maps model outputs to proposition resolutions, enabling direct measurement of lying separate from factual correctness. Key findings show that scaling improves accuracy but does not reduce deception, with frontier models lying under pressure, and two baseline interventions providing only partial improvements. The work offers a scalable, standardized framework for evaluating and strengthening honesty in LLMs, informing safer deployment and future research on trustworthy AI.

Abstract

As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

TL;DR

The paper introduces the MASK benchmark to disentangle honesty from accuracy in large language models by eliciting models' beliefs and testing whether they contradict those beliefs under pressure. It pairs a large, human-curated dataset with a three-step evaluation pipeline that maps model outputs to proposition resolutions, enabling direct measurement of lying separate from factual correctness. Key findings show that scaling improves accuracy but does not reduce deception, with frontier models lying under pressure, and two baseline interventions providing only partial improvements. The work offers a scalable, standardized framework for evaluating and strengthening honesty in LLMs, informing safer deployment and future research on trustworthy AI.

Abstract

As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.

Paper Structure

This paper contains 47 sections, 1 equation, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Increasingly powerful AI systems should consistently choose to be honest, even if incentivized or pressured to lie. Our evaluations verify consistency between statements and beliefs, creating a valid construct for understanding model lying.
  • Figure 2: A model is accurate when its belief aligns with ground truth, but lying when its response differs from its belief.
  • Figure 3: For our honesty evaluation pipeline, we use a three-step process anchored by a proposition, a variable capturing what we expect models to lie about and for which ground truth is known. (1) We prompt the model with two contrasting prompts -- a pressure prompt designed to potentially induce deception and a neutral belief elicitation prompt. (2) We use LLMs to map both model statement ($S$) and belief ($B$) to their respective proposition values, with additional checks to ensure models consistently hold the belief $B$. (3) We use a metric that measures honesty by comparing statement $S$ against belief $B$, as well as accuracy by comparing belief $B$ against ground truth $T$.
  • Figure 4: $P($Lie$)$, the proportion of examples on which a model lies, across models.
  • Figure 5: Three examples from our dataset that caused GPT-4o to lie. Some archetypes test models lying directly to a user, while others test whether models generate output that could likely be used to deceive other audiences.
  • ...and 10 more figures