Table of Contents
Fetching ...

RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models

Jiale Ding, Xiang Zheng, Yutao Wu, Cong Wang, Wei-Bin Lee, Ling Pan, Xingjun Ma, Yu-Gang Jiang

Abstract

As large language models (LLMs) are increasingly deployed as black-box components in real-world applications, red teaming has become essential for identifying potential risks. It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment. Ideally, effective red teaming should be adaptive to evolving LLM capabilities and explore a broad range of harmful topics. However, existing approaches face two limitations: 1) topic-based approaches rely on pre-collected harmful topics, limited in flexibility and adaptivity. 2) topic-free methods use reinforcement learning (RL), but they lack an explicit reward signal for exploration and tend to over-optimize a narrow objective, reducing topic diversity. To address these limitations, we propose RedTopic, a novel red teaming framework that generates topic-diverse adversarial prompts through a contextualized generation pipeline, an aggregate reward design, and a multi-objective RL training loop. Experiments show that RedTopic produces more effective and diverse adversarial prompts than existing methods, with notable improvements in integrated evaluation metrics. We believe RedTopic represents a step toward more adaptive and topic-diverse red teaming for large language models.

RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models

Abstract

As large language models (LLMs) are increasingly deployed as black-box components in real-world applications, red teaming has become essential for identifying potential risks. It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment. Ideally, effective red teaming should be adaptive to evolving LLM capabilities and explore a broad range of harmful topics. However, existing approaches face two limitations: 1) topic-based approaches rely on pre-collected harmful topics, limited in flexibility and adaptivity. 2) topic-free methods use reinforcement learning (RL), but they lack an explicit reward signal for exploration and tend to over-optimize a narrow objective, reducing topic diversity. To address these limitations, we propose RedTopic, a novel red teaming framework that generates topic-diverse adversarial prompts through a contextualized generation pipeline, an aggregate reward design, and a multi-objective RL training loop. Experiments show that RedTopic produces more effective and diverse adversarial prompts than existing methods, with notable improvements in integrated evaluation metrics. We believe RedTopic represents a step toward more adaptive and topic-diverse red teaming for large language models.

Paper Structure

This paper contains 44 sections, 12 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) Averaged topic diversity across different topic embedding models. Texts are sampled from JailbreakV-28K luo2024jailbreakv28k, where the Integrated texts are expected to achieve the highest score. LLaMA-Guard-3-1B meets this expectation, whereas CLIP-vit-base-patch32 does not. (b) Three representative adversarial prompts generated by topic-free methods when attacking GPT-4o. RFT perez-etal-2022-red predominantly produces prompts about hackers, while CALM zheng2025calm focuses on assassins, leading to topic monotony. In contrast, RedTopic generates prompts with diverse adversarial intents, reflected by the topic diversity score.
  • Figure 2: Empirical Pareto frontiers between ASR and diversities. The topic-based methods (numbered as 1-5) underperform in ASR, while topic-free baselines (6-9) exhibit significantly imbalanced results. In contrast, RedTopic consistently achieves robust trade-offs that lie on the Pareto frontier.
  • Figure 3: Overview of RedTopic. The framework comprises the contextualized adversarial prompt generation pipeline, the aggregate reward design, and the multi-objective RL training loop.
  • Figure 4: Distribution of successful attack samples based on MLCommons Taxonomy. Categories include: S1 (Violent Crimes), S2 (Sex-Related Crimes), S3 (Child Sexual Exploitation), S4 (Suicide & Self-Harm), S5 (Indiscriminate Weapons), S6 (Intellectual Property), S7 (Defamation), S8 (Non-Violent Crimes), S9 (Hate), S10 (Privacy), S11 (Specialized Advice), and S12 (Sexual Content).
  • Figure 5: (a) Comparison of different reward designs ("no Combination", "similar Combination", and "all Combination"). The colors get thicker as the training progresses. (b) Optimization trajectories of RedTopic with PPO and MOPPO. PPO converges prematurely, reducing $R_\text{H}$ in later stages, while MOPPO allows continuous exploration and achieves superior overall performance.
  • ...and 3 more figures