Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

Vernon Toh Yan Han; Rishabh Bhardwaj; Soujanya Poria

Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

Vernon Toh Yan Han, Rishabh Bhardwaj, Soujanya Poria

TL;DR

Ruby Teaming addresses automated red-teaming safety for LLMs by adding a memory-augmented archive to the MAP-Elites style search. The memory stores the history of mutations and feedback, guiding the mutator toward higher attack success and greater prompt diversity. Empirical evaluation on Llama 2-chat 7B shows Ruby achieving an Attack Success Rate of 0.74, surpassing Rainbow's 0.54, and yielding notable gains in Shannon’s Evenness Index and Simpson’s Diversity Index. The approach demonstrates that a modest memory depth ($k=3$) can substantially boost effectiveness and coverage, though results are restricted to smaller models and highlight potential misuse risks.

Abstract

We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.

Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

TL;DR

) can substantially boost effectiveness and coverage, though results are restricted to smaller models and highlight potential misuse risks.

Abstract

Paper Structure (27 sections, 3 equations, 3 figures, 4 tables)

This paper contains 27 sections, 3 equations, 3 figures, 4 tables.

Introduction
Ruby Teaming
(Step-1) Sampling.
(Step-2) Mutation.
(Step-3) Update.
Memory Update.
Experiments
Experimental Setup
Results on Llama 2-chat 7B
Attack Success Rate
Risk Category Diversity
Memory Size Analysis
Effectiveness as Seed Prompts
Conclusion
Limitations
...and 12 more sections

Figures (3)

Figure 1: The three steps involved in an iteration of Ruby Teaming are: (Step 1) Sample a prompt from the archive and sample the {risk, attack} category. (Step 2) Mutate the prompt. (Step 3) Update the archive if the mutated prompt increases the likelihood of harm. Update the memory dimension by pushing the previous prompt into memory and popping out the $k$th entry.
Figure 2: Attack Success Rate of adversarial prompts discovered by Ruby Teaming and Rainbow Teaming on Llama 2-chat 7B, evaluated using Llama Guard 2.
Figure 3: Attack success rate of adversarial prompts discovered by Ruby Teaming with varying memory sizes on Llama 2-chat 7B, measured using Llama Guard 2.

Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

TL;DR

Abstract

Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

Authors

TL;DR

Abstract

Table of Contents

Figures (3)