Table of Contents
Fetching ...

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

Deokjae Lee, JunYeong Lee, Jung-Woo Ha, Jin-Hwa Kim, Sang-Woo Lee, Hwaran Lee, Hyun Oh Song

TL;DR

This work tackles the challenge of efficient black-box red-teaming for large generative models by introducing Bayesian Red Teaming (BRT), which leverages Bayesian optimization to sequentially select or edit test inputs from a predefined pool. BRT blends a GP-based surrogate for the red-team score with a white-box diversity objective, enabling exploration that yields many offensive, diverse test cases under a fixed query budget. The approach demonstrates substantial improvements over baselines across open-domain dialogue, prompt continuation, and text-to-image generation, including strong results in hard-positive and human-evaluation settings. The method is versatile, scalable, and applicable to multiple victim-model families, with code available for replication and further safety research.

Abstract

The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods. The source code is available at https://github.com/snu-mllab/Bayesian-Red-Teaming.

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

TL;DR

This work tackles the challenge of efficient black-box red-teaming for large generative models by introducing Bayesian Red Teaming (BRT), which leverages Bayesian optimization to sequentially select or edit test inputs from a predefined pool. BRT blends a GP-based surrogate for the red-team score with a white-box diversity objective, enabling exploration that yields many offensive, diverse test cases under a fixed query budget. The approach demonstrates substantial improvements over baselines across open-domain dialogue, prompt continuation, and text-to-image generation, including strong results in hard-positive and human-evaluation settings. The method is versatile, scalable, and applicable to multiple victim-model families, with code available for replication and further safety research.

Abstract

The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods. The source code is available at https://github.com/snu-mllab/Bayesian-Red-Teaming.
Paper Structure (54 sections, 21 equations, 5 figures, 13 tables, 3 algorithms)

This paper contains 54 sections, 21 equations, 5 figures, 13 tables, 3 algorithms.

Figures (5)

  • Figure 1: Illustration of edit-based BRT. Edit-based BRT constructs a user input pool and generates test cases by selecting and editing user inputs in the pool. Here, our edit-based BRT is applied to BlenderBot-3B using the user input from Bot Adversarial Dialogue.
  • Figure 2: Cumulative number of discovered positive test cases of red teaming methods on Bloom ZS user input pool against BB-3B model. The dashed lines denote the search-based red teaming methods.
  • Figure 3: Examples of the original (solid line box) and edited test cases (dashed line box) discovered by hard positive red teaming with BRT (e) on various user input pools against BB-3B and GODEL-large.
  • Figure 4: Red teaming results on OPT-66B ZS user input pool under a query limit of $N=\text{20,000}$. For $\text{BRT}^{\text{~fix}}$(e+r), we vary $\lambda$ in the range of $\{0,0.05,0.1,0.2,0.3,0.4,0.6,1.0\}$. For BRT (e+r), we use the diversity budget $D\in\{40.0,43.0\}$.
  • Figure 5: Red teaming results on Empathetic Dialogues under a query limit of $N=\text{20,000}$. We fix $\lambda=0.3$ and vary $\eta$ in the range of $\{0,0.003,0.01,0.03,0.1\}$.

Theorems & Definitions (1)

  • Definition 1