Table of Contents
Fetching ...

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao

TL;DR

This survey provides a structured overview of LLM conversation safety, organizing existing work into attacks (inference-time and training-time), defenses (alignment, inference guidance, and filtering), and evaluations (datasets and metrics). It details concrete methods such as red-team prompts, jailbreak templates, and neural prompt-to-prompt attacks, while outlining defense strategies like RLHF-based alignment and system-prompt guidance. Key contributions include a hierarchical defense framework, a catalog of public datasets, and a discussion of evaluation challenges and the lack of standardized metrics. The work highlights critical practical implications for deploying safer LLMs and identifies open problems, including domain diversity of attacks and the need for unified evaluation criteria.

Abstract

Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

TL;DR

This survey provides a structured overview of LLM conversation safety, organizing existing work into attacks (inference-time and training-time), defenses (alignment, inference guidance, and filtering), and evaluations (datasets and metrics). It details concrete methods such as red-team prompts, jailbreak templates, and neural prompt-to-prompt attacks, while outlining defense strategies like RLHF-based alignment and system-prompt guidance. Key contributions include a hierarchical defense framework, a catalog of public datasets, and a discussion of evaluation challenges and the lack of standardized metrics. The work highlights critical practical implications for deploying safer LLMs and identifies open problems, including domain diversity of attacks and the need for unified evaluation criteria.

Abstract

Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.
Paper Structure (16 sections, 4 figures, 1 table)

This paper contains 16 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the three key aspects of LLM conversation safety: attacks, defenses, and evaluations. Attacks elicit unsafe responses from LLM, defenses enhance the safety of LLM's replies, and evaluations assess the outcomes.
  • Figure 2: Overview of attacks, defenses and evaluations for LLM conversation safety.
  • Figure 3: The unified pipeline of LLM attacks. The first step involves generating raw prompts (red team attacks) that contain malicious instructions. These prompts can optionally be enhanced through template-based attacks or neural prompt-to-prompt attacks. The prompts are then fed into the original LLM or the poisoned LLM obtained through training-time attacks, to get a response. Analyzing the obtained response reveals the outcome of the attack.
  • Figure 4: The hierarchical framework of LLM defenses. The framework consists of three layers: the innermost layer is the internal safety ability of the LLM model, which can be reinforced by safety alignmentat training time; the middle layer utilizes inference guidance techniques like system prompts to further enhance LLM's ability; at the outermost layer, filters are deployed to detect and filter malicious inputs or outputs. The middle and outermost layers safeguard the LLM at inference time.