Table of Contents
Fetching ...

Language Models can Subtly Deceive Without Lying: A Case Study on Strategic Phrasing in Legislation

Atharvan Dogra, Krishna Pillutla, Ameet Deshpande, Ananya B Sai, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, Balaraman Ravindran

TL;DR

This study introduces a two-agent testbed (lobbyist vs. critic) to study subtle deception in legislative language, focusing on strategic phrasing that encodes self-serving benefits while appearing neutral. Using the LobbyLens dataset of real-world bill–company pairs, the authors ground experiments where an LLM lobbyist drafts amendments to hide a benefactor, and a critic attempts to identify hidden beneficiaries. They show that re-planning and re-sampling can substantially enhance deception, with identification rates rising with model size, although stronger critics can mitigate this risk (GPT-4-Turbo achieving high detection in some setups). Human evaluations align with automated metrics, revealing robust deception strategies and consistent benefit retention across generations, underscoring significant safety implications for high-stakes domains and the need for defenses and further investigation.

Abstract

We explore the ability of large language models (LLMs) to engage in subtle deception through strategically phrasing and intentionally manipulating information. This harmful behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build a simple testbed mimicking a legislative environment where a corporate \textit{lobbyist} module is proposing amendments to bills that benefit a specific company while evading identification of this benefactor. We use real-world legislative bills matched with potentially affected companies to ground these interactions. Our results show that LLM lobbyists can draft subtle phrasing to avoid such identification by strong LLM-based detectors. Further optimization of the phrasing using LLM-based re-planning and re-sampling increases deception rates by up to 40 percentage points. Our human evaluations to verify the quality of deceptive generations and their retention of self-serving intent show significant coherence with our automated metrics and also help in identifying certain strategies of deceptive phrasing. This study highlights the risk of LLMs' capabilities for strategic phrasing through seemingly neutral language to attain self-serving goals. This calls for future research to uncover and protect against such subtle deception.

Language Models can Subtly Deceive Without Lying: A Case Study on Strategic Phrasing in Legislation

TL;DR

This study introduces a two-agent testbed (lobbyist vs. critic) to study subtle deception in legislative language, focusing on strategic phrasing that encodes self-serving benefits while appearing neutral. Using the LobbyLens dataset of real-world bill–company pairs, the authors ground experiments where an LLM lobbyist drafts amendments to hide a benefactor, and a critic attempts to identify hidden beneficiaries. They show that re-planning and re-sampling can substantially enhance deception, with identification rates rising with model size, although stronger critics can mitigate this risk (GPT-4-Turbo achieving high detection in some setups). Human evaluations align with automated metrics, revealing robust deception strategies and consistent benefit retention across generations, underscoring significant safety implications for high-stakes domains and the need for defenses and further investigation.

Abstract

We explore the ability of large language models (LLMs) to engage in subtle deception through strategically phrasing and intentionally manipulating information. This harmful behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build a simple testbed mimicking a legislative environment where a corporate \textit{lobbyist} module is proposing amendments to bills that benefit a specific company while evading identification of this benefactor. We use real-world legislative bills matched with potentially affected companies to ground these interactions. Our results show that LLM lobbyists can draft subtle phrasing to avoid such identification by strong LLM-based detectors. Further optimization of the phrasing using LLM-based re-planning and re-sampling increases deception rates by up to 40 percentage points. Our human evaluations to verify the quality of deceptive generations and their retention of self-serving intent show significant coherence with our automated metrics and also help in identifying certain strategies of deceptive phrasing. This study highlights the risk of LLMs' capabilities for strategic phrasing through seemingly neutral language to attain self-serving goals. This calls for future research to uncover and protect against such subtle deception.
Paper Structure (27 sections, 4 equations, 10 figures, 8 tables)

This paper contains 27 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An illustration of the overall framework. The lobbyist agent tries to subtly deceive a critic by hiding a secret benefit to a benefactor company in the proposed amendment. Here, in Trial 2, the agent replaces the specific focus on "cardiovascular diseases" (a business focus of Happyheart Corp.) with "regenerative medicines" -- also used for cardiovascular disease -- hence generating the same benefit to Happyheart Corp. but avoiding identification. *: Task -- generating benefactor favoring amendment suggestions. †: Further steps take place if benefactor is identified correctly (assigned the highest probability).
  • Figure 2: We report supeccessful identifications of benefactors (blown deception) by the critic; unidentified trials are presumed successful deception by the lobbyist. Separate instances of the same model act as the lobbyist and the critic. We find a general trend of increase in identification with increasing model size within each family, i.e. larger critics are harder to deceive. At the same time, the drop in identification rate across trials (corresponding to an increase in deception) also increases with model size, i.e. deception capabilities increase with model size. Error bars denote the standard deviation from $10K$ bootstrap iterations.
  • Figure 3: Identification rates when the critic LLM is constant. Models are denoted in the x-tick labels. With the stronger critic, we observe the identification rates of the weakest 7B model rise higher ($\sim 40$ percentage points) than 14B. With weaker critics, we observe a fall in identification rates of up to $\sim 60$pp. for the Qwen 72B lobbyist (with the Qwen 7B critic).
  • Figure 4: Distribution of identified deception patterns in amendment suggestions, based on manual evaluation of model generations. Each axis represents a type of subtle linguistic deception; the plotted values indicate the percentage of evaluated cases where each pattern was observed. Cross-benefit diversion was the most frequent, appearing in over $\sim 81\%$ of reviewed samples.
  • Figure 5: We see a significant increase in identification (w/o re-plan) by skipping the re-plannig step between trials.
  • ...and 5 more figures