Table of Contents
Fetching ...

LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal, Joseph Campbell, Simon Stepputtis, Shao-Yen Tseng

TL;DR

This work presents LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations and reveals that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.

Abstract

Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes. In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations. At its core, LieCraft is a novel multiplayer hidden-role game in which players select an ethical alignment and execute strategies over a long time-horizon to accomplish missions. Cooperators work together to solve event challenges and expose bad actors, while Defectors evade suspicion while secretly sabotaging missions. To enable real-world relevance, we develop 10 grounded scenarios such as childcare, hospital resource allocation, and loan underwriting that recontextualize the underlying mechanics in ethically significant, high-stakes domains. We ensure balanced gameplay in LieCraft through careful design of game mechanics and reward structures that incentivize meaningful strategic choices while eliminating degenerate strategies. Beyond the framework itself, we report results from 12 state-of-the-art LLMs across three behavioral axes: propensity to defect, deception skill, and accusation accuracy. Our findings reveal that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.

LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

TL;DR

This work presents LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations and reveals that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.

Abstract

Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes. In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations. At its core, LieCraft is a novel multiplayer hidden-role game in which players select an ethical alignment and execute strategies over a long time-horizon to accomplish missions. Cooperators work together to solve event challenges and expose bad actors, while Defectors evade suspicion while secretly sabotaging missions. To enable real-world relevance, we develop 10 grounded scenarios such as childcare, hospital resource allocation, and loan underwriting that recontextualize the underlying mechanics in ethically significant, high-stakes domains. We ensure balanced gameplay in LieCraft through careful design of game mechanics and reward structures that incentivize meaningful strategic choices while eliminating degenerate strategies. Beyond the framework itself, we report results from 12 state-of-the-art LLMs across three behavioral axes: propensity to defect, deception skill, and accusation accuracy. Our findings reveal that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.
Paper Structure (44 sections, 16 equations, 16 figures, 4 tables)

This paper contains 44 sections, 16 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: A high level diagram of our the LieCraft framework. Given a specific theme, the game begins with $N=5$ players receiving a mission, viewing potential rewards, and choosing their roles as either Cooperator or Defector. Consequently, an event is launched with a set of drawn cards. Each player takes a turn playing one cards to maximize information and role-based rewards, followed by open discussion and voting phases. Before a mission ends, players may accuse another of being a defector. After three missions, the game completes and player with highest scores wins.
  • Figure 2: TrueSkill ranking of all models evaluated in LieCraft. The order indicates overall rank, while $\mu$ and $\sigma$ indicate skill level and uncertainty respectively
  • Figure 3: RQ1: Role selection rates across models and themes. We find a diverse behavior across models. We set the midpoint of the colormap to $0.75$ to reflect the relative risk of models choosing unethical alignments.
  • Figure 4: RQ2, RQ3: defector sabotage rates (outperforming cooperators) versus accusation score.As models improve at identifying liars, their ability to deceive also increases.
  • Figure 5: Proportion of types of (successful) deceptive speech displayed by each model in the multiplayer setting.
  • ...and 11 more figures