Table of Contents
Fetching ...

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin, Xudong Han, Haonan Li

TL;DR

The paper addresses the safety of Generative AI by proposing a fine-grained taxonomy of red-teaming attacks rooted in intrinsic model capabilities, and by introducing the searcher framework to formalize automated red-teaming as a three-component problem (state space, goal, operation). It systematically surveys risk taxonomies, attack methods, evaluation benchmarks, and defense strategies across language and multimodal models, including downstream LLM-based applications. Key contributions include a comprehensive taxonomy, a unified “searcher” formulation, and analyses of multimodal, overkill, and agent-based risks with forward-looking directions for standardized evaluation and robust defenses. The work provides a cohesive blueprint for researchers and practitioners to assess, compare, and strengthen GenAI safety across modalities and real-world deployments.

Abstract

Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safe use as various vulnerabilities are exposed. In light of this, the field of red teaming is undergoing fast-paced growth, highlighting the need for a comprehensive survey covering the entire pipeline and addressing emerging topics. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the "searcher" framework to unify various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness.

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

TL;DR

The paper addresses the safety of Generative AI by proposing a fine-grained taxonomy of red-teaming attacks rooted in intrinsic model capabilities, and by introducing the searcher framework to formalize automated red-teaming as a three-component problem (state space, goal, operation). It systematically surveys risk taxonomies, attack methods, evaluation benchmarks, and defense strategies across language and multimodal models, including downstream LLM-based applications. Key contributions include a comprehensive taxonomy, a unified “searcher” formulation, and analyses of multimodal, overkill, and agent-based risks with forward-looking directions for standardized evaluation and robust defenses. The work provides a cohesive blueprint for researchers and practitioners to assess, compare, and strengthen GenAI safety across modalities and real-world deployments.

Abstract

Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safe use as various vulnerabilities are exposed. In light of this, the field of red teaming is undergoing fast-paced growth, highlighting the need for a comprehensive survey covering the entire pipeline and addressing emerging topics. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the "searcher" framework to unify various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness.
Paper Structure (128 sections, 2 equations, 33 figures, 5 tables)

This paper contains 128 sections, 2 equations, 33 figures, 5 tables.

Figures (33)

  • Figure 1: Distribution of red teaming papers by type from 2023 onwards. Red represents attack papers discussing new attack strategies; blue for defense papers; purple for benchmark papers, which propose new benchmarks to investigate metrics; yellow marks phenomenon papers that uncover new phenomena related to safety of generative models; and orange is for survey papers.
  • Figure 2: An overview of GenAI red teaming flow. Key components and workflow are shown on the left, with the details or examples of each step on the right.
  • Figure 3: The main differences in commonly-used attack terms.
  • Figure 4: Harm type categories in selected work.
  • Figure 5: Illustration of risk taxonomy examples. We categorize the methods for assessing the risk associated with AI from five aspects. Color-coded lines connect specific examples of risky query attacks, presented below, to illustrate how each is categorized according to the relevant criteria.
  • ...and 28 more figures