Table of Contents
Fetching ...

Are You Human? An Adversarial Benchmark to Expose LLMs

Gilad Gressel, Rahul Pankajakshan, Yisroel Mirsky

TL;DR

The paper presents an active defense framework for real-time detection of LLM impostors in conversations using explicit and implicit prompts, complemented by an open-source benchmark dataset of 210 prompts across benign and malicious scenarios. Through evaluating nine LMSYS leaderboard models, it demonstrates that explicit challenges reliably expose LLM impostors (78.4% success) while implicit challenges are less effective (22.9%), with humans outperforming LLMs on explicit tasks. Two MTurk user studies confirm practical applicability and reveal dual-use implications, including human misuse of AI for completing challenges. The work contributes a formal defense paradigm, a diverse benchmark, and insights into model vulnerabilities to guide robust LLM security in high-stakes interactions.

Abstract

Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation, raising concerns about their potential misuse in scams and deception. Humans have a right to know if they are conversing to an LLM. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time. To this end we compile and release an open-source benchmark dataset that includes 'implicit challenges' that exploit an LLM's instruction-following mechanism to cause role deviation, and 'exlicit challenges' that test an LLM's ability to perform simple tasks typically easy for humans but difficult for LLMs. Our evaluation of 9 leading models from the LMSYS leaderboard revealed that explicit challenges successfully detected LLMs in 78.4% of cases, while implicit challenges were effective in 22.9% of instances. User studies validate the real-world applicability of our methods, with humans outperforming LLMs on explicit challenges (78% vs 22% success rate). Our framework unexpectedly revealed that many study participants were using LLMs to complete tasks, demonstrating its effectiveness in detecting both AI impostors and human misuse of AI tools. This work addresses the critical need for reliable, real-time LLM detection methods in high-stakes conversations.

Are You Human? An Adversarial Benchmark to Expose LLMs

TL;DR

The paper presents an active defense framework for real-time detection of LLM impostors in conversations using explicit and implicit prompts, complemented by an open-source benchmark dataset of 210 prompts across benign and malicious scenarios. Through evaluating nine LMSYS leaderboard models, it demonstrates that explicit challenges reliably expose LLM impostors (78.4% success) while implicit challenges are less effective (22.9%), with humans outperforming LLMs on explicit tasks. Two MTurk user studies confirm practical applicability and reveal dual-use implications, including human misuse of AI for completing challenges. The work contributes a formal defense paradigm, a diverse benchmark, and insights into model vulnerabilities to guide robust LLM security in high-stakes interactions.

Abstract

Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation, raising concerns about their potential misuse in scams and deception. Humans have a right to know if they are conversing to an LLM. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time. To this end we compile and release an open-source benchmark dataset that includes 'implicit challenges' that exploit an LLM's instruction-following mechanism to cause role deviation, and 'exlicit challenges' that test an LLM's ability to perform simple tasks typically easy for humans but difficult for LLMs. Our evaluation of 9 leading models from the LMSYS leaderboard revealed that explicit challenges successfully detected LLMs in 78.4% of cases, while implicit challenges were effective in 22.9% of instances. User studies validate the real-world applicability of our methods, with humans outperforming LLMs on explicit challenges (78% vs 22% success rate). Our framework unexpectedly revealed that many study participants were using LLMs to complete tasks, demonstrating its effectiveness in detecting both AI impostors and human misuse of AI tools. This work addresses the critical need for reliable, real-time LLM detection methods in high-stakes conversations.

Paper Structure

This paper contains 19 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: GPT, Claude, Gemini, Mistral, and LLama via API will all comply with these system prompts. We devise explicit and implicit challenges to reveal LLMs using them
  • Figure 2: Top panel: The offender LLM roleplays as an IRS agent while the defender introduces an implicit challenge. Bottom panel: Offender follows the challenge and reveals its true identity as an LLM.
  • Figure 3: Benign Scenario: Tactic Level Success Rate against Naive and Robust Offender. The heatmap displays the effectiveness of different prompt tactics, both implicit and explicit, in exposing the LLM impostor.
  • Figure 4: Malicious Scenario: Tactic Level Success Rate against Naive and Robust Offender. The heatmap displays the effectiveness of different prompt tactics, both implicit and explicit, in exposing the LLM impostor.
  • Figure 5: Success Rate of Explicit Challenges in Exposing LLM Imposters. Heatmap showing performance on Basic Math and String Processing tasks under various system prompts. Performance improves without conflicting objectives (CO).
  • ...and 8 more figures