Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao
TL;DR
This survey provides a structured overview of LLM conversation safety, organizing existing work into attacks (inference-time and training-time), defenses (alignment, inference guidance, and filtering), and evaluations (datasets and metrics). It details concrete methods such as red-team prompts, jailbreak templates, and neural prompt-to-prompt attacks, while outlining defense strategies like RLHF-based alignment and system-prompt guidance. Key contributions include a hierarchical defense framework, a catalog of public datasets, and a discussion of evaluation challenges and the lack of standardized metrics. The work highlights critical practical implications for deploying safer LLMs and identifies open problems, including domain diversity of attacks and the need for unified evaluation criteria.
Abstract
Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.
