Table of Contents
Fetching ...

LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

Som Sagar, Aditya Taparia, Ransalu Senanayake

TL;DR

This paper improves the "Failures are fated, but can be faded"framework (arXiv:2406.07145)--a post-hoc method to explore and construct the failure landscape in pre-trained generative models--with a variety of deep reinforcement learning algorithms, screening tests, and LLM-based rewards and state generation, and demonstrates how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes.

Abstract

In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug or audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we improve the "Failures are fated, but can be faded" framework (arXiv:2406.07145)--a post-hoc method to explore and construct the failure landscape in pre-trained generative models--with a variety of deep reinforcement learning algorithms, screening tests, and LLM-based rewards and state generation. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically demonstrate the effectiveness of the proposed method on diffusion models. We also highlight the strengths and weaknesses of each algorithm in identifying failure modes.

LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

TL;DR

This paper improves the "Failures are fated, but can be faded"framework (arXiv:2406.07145)--a post-hoc method to explore and construct the failure landscape in pre-trained generative models--with a variety of deep reinforcement learning algorithms, screening tests, and LLM-based rewards and state generation, and demonstrates how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes.

Abstract

In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug or audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we improve the "Failures are fated, but can be faded" framework (arXiv:2406.07145)--a post-hoc method to explore and construct the failure landscape in pre-trained generative models--with a variety of deep reinforcement learning algorithms, screening tests, and LLM-based rewards and state generation. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically demonstrate the effectiveness of the proposed method on diffusion models. We also highlight the strengths and weaknesses of each algorithm in identifying failure modes.

Paper Structure

This paper contains 16 sections, 4 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: There are three main steps in the proposed failure discovery and mitigation framework. 1. Discover: We propose a deep RL-based method to explore the failure landscape with microscopic and macroscopic exploration strategies. It will discover regions where the model works and fails, with varying levels of confidence. 2. Summarize: Results are qualitatively and quantitatively summarized for the user to indicate preferences. 3. Restructure: Based on the user's preferences from the previous stage, the model can be fine-tuned to mitigate or shift away the failure modes to unlikely regions. The center image shows images generated by Stable Diffusion v1-4 for the prompt "Create an image of a distinctive < artist> analyzing data on a computer in a < research center>". A user selects the most likely failure in terms of image quality from the summary report. The fine-tuned model, based on user preferences, has generated more naturalistic images.
  • Figure 2: Wordcloud of prompts from a) predefined states b) LLM generated states
  • Figure 3: LLM Reward Function
  • Figure 4: a) A visualization of the failure landscape. b) We can observe sample failures, get quantitative distances. We see a shift in the failure mode (yellow) after fine-tuning.
  • Figure 5: Failure mode shift in DQN, PPO and A2C
  • ...and 7 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2: Reduced Failures