Are You Human? An Adversarial Benchmark to Expose LLMs
Gilad Gressel, Rahul Pankajakshan, Yisroel Mirsky
TL;DR
The paper presents an active defense framework for real-time detection of LLM impostors in conversations using explicit and implicit prompts, complemented by an open-source benchmark dataset of 210 prompts across benign and malicious scenarios. Through evaluating nine LMSYS leaderboard models, it demonstrates that explicit challenges reliably expose LLM impostors (78.4% success) while implicit challenges are less effective (22.9%), with humans outperforming LLMs on explicit tasks. Two MTurk user studies confirm practical applicability and reveal dual-use implications, including human misuse of AI for completing challenges. The work contributes a formal defense paradigm, a diverse benchmark, and insights into model vulnerabilities to guide robust LLM security in high-stakes interactions.
Abstract
Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation, raising concerns about their potential misuse in scams and deception. Humans have a right to know if they are conversing to an LLM. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time. To this end we compile and release an open-source benchmark dataset that includes 'implicit challenges' that exploit an LLM's instruction-following mechanism to cause role deviation, and 'exlicit challenges' that test an LLM's ability to perform simple tasks typically easy for humans but difficult for LLMs. Our evaluation of 9 leading models from the LMSYS leaderboard revealed that explicit challenges successfully detected LLMs in 78.4% of cases, while implicit challenges were effective in 22.9% of instances. User studies validate the real-world applicability of our methods, with humans outperforming LLMs on explicit challenges (78% vs 22% success rate). Our framework unexpectedly revealed that many study participants were using LLMs to complete tasks, demonstrating its effectiveness in detecting both AI impostors and human misuse of AI tools. This work addresses the critical need for reliable, real-time LLM detection methods in high-stakes conversations.
