Table of Contents
Fetching ...

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Xiao Liu, Liangzhi Li, Tong Xiang, Fuying Ye, Lu Wei, Wangyue Li, Noa Garcia

TL;DR

The paper identifies a latent threat in LLM safety where malicious actors can extract harmful information through multi-turn dialogues, not just explicit prompts. It introduces Imposter.AI, an automated pipeline that seeds sub-questions from an uncensored oracle, applies strategies to decompose and obfuscate queries, and then aggregates the target LLM’s responses to reveal harmful content. Across GPT-4, GPT-3.5-turbo, and Llama2, the approach yields high harmfulness and executability scores, with Llama2 showing notable resilience, underscoring a new dimension of safety that hinges on intent detection in conversations. The work motivates stronger defenses and nuanced evaluation methods to balance information usefulness with robust protection against covert adversarial tactics.

Abstract

With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

TL;DR

The paper identifies a latent threat in LLM safety where malicious actors can extract harmful information through multi-turn dialogues, not just explicit prompts. It introduces Imposter.AI, an automated pipeline that seeds sub-questions from an uncensored oracle, applies strategies to decompose and obfuscate queries, and then aggregates the target LLM’s responses to reveal harmful content. Across GPT-4, GPT-3.5-turbo, and Llama2, the approach yields high harmfulness and executability scores, with Llama2 showing notable resilience, underscoring a new dimension of safety that hinges on intent detection in conversations. The work motivates stronger defenses and nuanced evaluation methods to balance information usefulness with robust protection against covert adversarial tactics.

Abstract

With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?
Paper Structure (50 sections, 22 figures, 4 tables)

This paper contains 50 sections, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Conversations conducted on GPT-4 with different adversarial attacks: (a) Direct harmful question, which is rejected. (b) Harmful question associated with jailbreak prompt DAN DANprompt, which elicits a harmful response. (c) Our proposed method, Imposter.AI, which elicits a harmful summary of the conversation by asking multiple questions. Red boxes represent the existence of harmful contents, whereas green is used for safe responses and blue for questions with hidden malicious purposes.
  • Figure 2: Illustration of Imposter.AI. Red boxes use the uncensored LLM; blue boxes use any agent LLM (GPT-4 in our settings); green boxes use the target LLM for conversation and summarization.
  • Figure 3: Illustrative examples of various techniques employed in our experiments. Each technique is demonstrated with a practical application, highlighting its function and response. Note that these examples are all taken from actual interactions with GPT-4.
  • Figure 4: Harmfulness and executability for different proposed techniques on GPT-4.
  • Figure 5: Comparison of combinations across different question categories on GPT-4.
  • ...and 17 more figures