Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
Xiao Liu, Liangzhi Li, Tong Xiang, Fuying Ye, Lu Wei, Wangyue Li, Noa Garcia
TL;DR
The paper identifies a latent threat in LLM safety where malicious actors can extract harmful information through multi-turn dialogues, not just explicit prompts. It introduces Imposter.AI, an automated pipeline that seeds sub-questions from an uncensored oracle, applies strategies to decompose and obfuscate queries, and then aggregates the target LLM’s responses to reveal harmful content. Across GPT-4, GPT-3.5-turbo, and Llama2, the approach yields high harmfulness and executability scores, with Llama2 showing notable resilience, underscoring a new dimension of safety that hinges on intent detection in conversations. The work motivates stronger defenses and nuanced evaluation methods to balance information usefulness with robust protection against covert adversarial tactics.
Abstract
With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?
