Table of Contents
Fetching ...

Can a large language model be a gaslighter?

Wei Li, Luyao Zhu, Yang Song, Ruixi Lin, Rui Mao, Yang You

TL;DR

A two-stage framework DeepCoG is proposed to elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and acquire gaslighting conversations from LLMs through the Chain-of-Gaslighting method to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks.

Abstract

Large language models (LLMs) have gained human trust due to their capabilities and helpfulness. However, this in turn may allow LLMs to affect users' mindsets by manipulating language. It is termed as gaslighting, a psychological effect. In this work, we aim to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks. Therefore, we propose a two-stage framework DeepCoG designed to: 1) elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and 2) acquire gaslighting conversations from LLMs through our Chain-of-Gaslighting method. The gaslighting conversation dataset along with a corresponding safe dataset is applied to fine-tuning-based attacks on open-source LLMs and anti-gaslighting safety alignment on these LLMs. Experiments demonstrate that both prompt-based and fine-tuning-based attacks transform three open-source LLMs into gaslighters. In contrast, we advanced three safety alignment strategies to strengthen (by 12.05%) the safety guardrail of LLMs. Our safety alignment strategies have minimal impacts on the utility of LLMs. Empirical studies indicate that an LLM may be a potential gaslighter, even if it passed the harmfulness test on general dangerous queries.

Can a large language model be a gaslighter?

TL;DR

A two-stage framework DeepCoG is proposed to elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and acquire gaslighting conversations from LLMs through the Chain-of-Gaslighting method to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks.

Abstract

Large language models (LLMs) have gained human trust due to their capabilities and helpfulness. However, this in turn may allow LLMs to affect users' mindsets by manipulating language. It is termed as gaslighting, a psychological effect. In this work, we aim to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks. Therefore, we propose a two-stage framework DeepCoG designed to: 1) elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and 2) acquire gaslighting conversations from LLMs through our Chain-of-Gaslighting method. The gaslighting conversation dataset along with a corresponding safe dataset is applied to fine-tuning-based attacks on open-source LLMs and anti-gaslighting safety alignment on these LLMs. Experiments demonstrate that both prompt-based and fine-tuning-based attacks transform three open-source LLMs into gaslighters. In contrast, we advanced three safety alignment strategies to strengthen (by 12.05%) the safety guardrail of LLMs. Our safety alignment strategies have minimal impacts on the utility of LLMs. Empirical studies indicate that an LLM may be a potential gaslighter, even if it passed the harmfulness test on general dangerous queries.

Paper Structure

This paper contains 36 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The responses of LLMs given a gaslighting conversation history.
  • Figure 2: The proposed DeepCoG framework. DeepCoG is not only a key component for investigating the vulnerability of LLMs to prompt-based attack but also a paradigm for building gaslighting and safe conversation datasets. The psychological concepts, backgrounds, and personae lend theoretical support and practical grounding to the gaslighting contents elicited in conversation scenarios.
  • Figure 3: Fine-tuning-based attack & safety alignment strategies.
  • Figure 4: Fine-tuning-based gaslighting attack on three open-source LLMs.
  • Figure 5: Anti-Gaslighting score distribution of open-source LLMs over dialogue history length. The line shadow represents the 95% confidence interval of the estimate.
  • ...and 3 more figures