Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
Vernon Toh Yan Han, Rishabh Bhardwaj, Soujanya Poria
TL;DR
Ruby Teaming addresses automated red-teaming safety for LLMs by adding a memory-augmented archive to the MAP-Elites style search. The memory stores the history of mutations and feedback, guiding the mutator toward higher attack success and greater prompt diversity. Empirical evaluation on Llama 2-chat 7B shows Ruby achieving an Attack Success Rate of 0.74, surpassing Rainbow's 0.54, and yielding notable gains in Shannon’s Evenness Index and Simpson’s Diversity Index. The approach demonstrates that a modest memory depth ($k=3$) can substantially boost effectiveness and coverage, though results are restricted to smaller models and highlight potential misuse risks.
Abstract
We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.
