FERRET: Framework for Expansion Reliant Red Teaming

Ninareh Mehrabi; Vitor Albiero; Maya Pavlova; Joanna Bitton

FERRET: Framework for Expansion Reliant Red Teaming

Ninareh Mehrabi, Vitor Albiero, Maya Pavlova, Joanna Bitton

Abstract

We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red Teaming) and compare it with various existing automated red teaming approaches. In our experiments, we demonstrate the effectiveness of FERRET in generating effective multi-modal adversarial conversations and its superior performance against existing state of the art approaches.

FERRET: Framework for Expansion Reliant Red Teaming

Abstract

Paper Structure (18 sections, 3 figures, 4 tables)

This paper contains 18 sections, 3 figures, 4 tables.

Introduction
FERRET
Framework
Horizontal Expansion
Vertical Expansion
Meta Expansion
Experiments and Results
Main Experiments
Main Results
Single Turn Ablations
Single Turn Ablation Results
Sampling Ablations
Sampling Ablations Results
Human Studies
Human Studies Results
...and 3 more sections

Figures (3)

Figure 1: Overview of the FERRET framework. The framework gets policy descriptions, attack strategies, and few-shot examples as inputs. In horizontal expansion, the attack model self-evolves and explores what conversation starters are more effective by logging them in the horizontal feedback logs. The attacks are turned into multi-modal attacks using the transformation toolkit that takes the attack in XML format and applies appropriate transformations depending on the attack to create a multi-modal version of the attack. In vertical expansion, the conversation starters are expanded into full conversations. Once the conversation is fully formed, the conversation is saved in conversation logs and next conversation starter is generated.
Figure 2: TSNE plots demonstrating diversity results for the main experiments comparing different approaches with regards to the diversity of the generated attacks for each policy.
Figure 3: TSNE plots demonstrating diversity results for the ablation study comparing FERRET to FLIRT in single-turn setup with regards to the diversity of the generated attacks for each policy.

FERRET: Framework for Expansion Reliant Red Teaming

Abstract

FERRET: Framework for Expansion Reliant Red Teaming

Authors

Abstract

Table of Contents

Figures (3)