Table of Contents
Fetching ...

Espresso: Robust Concept Filtering in Text-to-Image Models

Anudeep Das, Vasisht Duddu, Rui Zhang, N. Asokan

TL;DR

Espresso tackles the triple challenge of eliminating unacceptable concepts in diffusion-based text-to-image models while preserving utility and resisting adversarial prompts. It does so by a CLIP-based filtering approach that simultaneously considers distances to both unacceptable and acceptable concept embeddings, restricting adversaries to directions that are harder to exploit. The method is augmented with targeted fine-tuning to recover any utility loss, and it is empirically shown to outperform prior fine-tuning CRTs and filtering methods in effectiveness and robustness across multiple concept groups. A first-certified robustness exploration provides theoretical and empirical bounds on the encoder-space perturbations the filter can tolerate, while practical implications suggest Espresso offers a robust, adaptable solution for real-world T2I safety. Overall, Espresso demonstrates a favorable trade-off among reducing unacceptable content, maintaining quality on acceptable content, and resisting evasion strategies, with broad applicability to various T2I models and content domains.

Abstract

Diffusion based text-to-image models are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright-infringing or unsafe). We need concept removal techniques (CRTs) which are i) effective in preventing the generation of images with unacceptable concepts, ii) utility-preserving on acceptable concepts, and, iii) robust against evasion with adversarial prompts. No prior CRT satisfies all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). We identify unacceptable concepts by using the distance between the embedding of a generated image to the text embeddings of both unacceptable and acceptable concepts. This lets us fine-tune for robustness by separating the text embeddings of unacceptable and acceptable concepts while preserving utility. We present a pipeline to evaluate various CRTs to show that Espresso is more effective and robust than prior CRTs, while retaining utility.

Espresso: Robust Concept Filtering in Text-to-Image Models

TL;DR

Espresso tackles the triple challenge of eliminating unacceptable concepts in diffusion-based text-to-image models while preserving utility and resisting adversarial prompts. It does so by a CLIP-based filtering approach that simultaneously considers distances to both unacceptable and acceptable concept embeddings, restricting adversaries to directions that are harder to exploit. The method is augmented with targeted fine-tuning to recover any utility loss, and it is empirically shown to outperform prior fine-tuning CRTs and filtering methods in effectiveness and robustness across multiple concept groups. A first-certified robustness exploration provides theoretical and empirical bounds on the encoder-space perturbations the filter can tolerate, while practical implications suggest Espresso offers a robust, adaptable solution for real-world T2I safety. Overall, Espresso demonstrates a favorable trade-off among reducing unacceptable content, maintaining quality on acceptable content, and resisting evasion strategies, with broad applicability to various T2I models and content domains.

Abstract

Diffusion based text-to-image models are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright-infringing or unsafe). We need concept removal techniques (CRTs) which are i) effective in preventing the generation of images with unacceptable concepts, ii) utility-preserving on acceptable concepts, and, iii) robust against evasion with adversarial prompts. No prior CRT satisfies all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). We identify unacceptable concepts by using the distance between the embedding of a generated image to the text embeddings of both unacceptable and acceptable concepts. This lets us fine-tune for robustness by separating the text embeddings of unacceptable and acceptable concepts while preserving utility. We present a pipeline to evaluate various CRTs to show that Espresso is more effective and robust than prior CRTs, while retaining utility.
Paper Structure (30 sections, 1 theorem, 22 equations, 4 figures, 14 tables)

This paper contains 30 sections, 1 theorem, 22 equations, 4 figures, 14 tables.

Key Result

Theorem 1

Let $\hat{x} = \phi_x(x), \hat{c}^i = \phi_p(c^i), i\in\{a,u\}$. Define where $s(\hat{x},\hat{c}^i) = \tau \text{cos}(\hat{x},\hat{c}^i))$, then $g_i$ is the confidence of $\hat{x}$ being classified as $c\xspace^i$. $F(x)$ in equation eq:att can be defined as $F(\hat{x})=argmax_i g_i(\hat{x})$, and $F(\hat{x})$ classifies $\hat{x}$ as unacceptable if $g_u(\hat{x})>\Gamm and $\Gamma$ is the decisi

Figures (4)

  • Figure 1: Overview of pipeline for evaluating CRTs. Prompts, images and arrows for unacceptable (acceptable) in red (green).
  • Figure 2: Espresso is better than other fine-tuning CRTs.
  • Figure 3: Espresso has a better trade-off than UD filter.
  • Figure 4: Certified accuracy of Espresso vs. adversarial noise $\delta$, for a strong $\mathcal{A}dv$ with access to embeddings of generated images.

Theorems & Definitions (2)

  • Theorem 1
  • proof