Language Models can Subtly Deceive Without Lying: A Case Study on Strategic Phrasing in Legislation
Atharvan Dogra, Krishna Pillutla, Ameet Deshpande, Ananya B Sai, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, Balaraman Ravindran
TL;DR
This study introduces a two-agent testbed (lobbyist vs. critic) to study subtle deception in legislative language, focusing on strategic phrasing that encodes self-serving benefits while appearing neutral. Using the LobbyLens dataset of real-world bill–company pairs, the authors ground experiments where an LLM lobbyist drafts amendments to hide a benefactor, and a critic attempts to identify hidden beneficiaries. They show that re-planning and re-sampling can substantially enhance deception, with identification rates rising with model size, although stronger critics can mitigate this risk (GPT-4-Turbo achieving high detection in some setups). Human evaluations align with automated metrics, revealing robust deception strategies and consistent benefit retention across generations, underscoring significant safety implications for high-stakes domains and the need for defenses and further investigation.
Abstract
We explore the ability of large language models (LLMs) to engage in subtle deception through strategically phrasing and intentionally manipulating information. This harmful behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build a simple testbed mimicking a legislative environment where a corporate \textit{lobbyist} module is proposing amendments to bills that benefit a specific company while evading identification of this benefactor. We use real-world legislative bills matched with potentially affected companies to ground these interactions. Our results show that LLM lobbyists can draft subtle phrasing to avoid such identification by strong LLM-based detectors. Further optimization of the phrasing using LLM-based re-planning and re-sampling increases deception rates by up to 40 percentage points. Our human evaluations to verify the quality of deceptive generations and their retention of self-serving intent show significant coherence with our automated metrics and also help in identifying certain strategies of deceptive phrasing. This study highlights the risk of LLMs' capabilities for strategic phrasing through seemingly neutral language to attain self-serving goals. This calls for future research to uncover and protect against such subtle deception.
