Espresso: Robust Concept Filtering in Text-to-Image Models
Anudeep Das, Vasisht Duddu, Rui Zhang, N. Asokan
TL;DR
Espresso tackles the triple challenge of eliminating unacceptable concepts in diffusion-based text-to-image models while preserving utility and resisting adversarial prompts. It does so by a CLIP-based filtering approach that simultaneously considers distances to both unacceptable and acceptable concept embeddings, restricting adversaries to directions that are harder to exploit. The method is augmented with targeted fine-tuning to recover any utility loss, and it is empirically shown to outperform prior fine-tuning CRTs and filtering methods in effectiveness and robustness across multiple concept groups. A first-certified robustness exploration provides theoretical and empirical bounds on the encoder-space perturbations the filter can tolerate, while practical implications suggest Espresso offers a robust, adaptable solution for real-world T2I safety. Overall, Espresso demonstrates a favorable trade-off among reducing unacceptable content, maintaining quality on acceptable content, and resisting evasion strategies, with broad applicability to various T2I models and content domains.
Abstract
Diffusion based text-to-image models are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright-infringing or unsafe). We need concept removal techniques (CRTs) which are i) effective in preventing the generation of images with unacceptable concepts, ii) utility-preserving on acceptable concepts, and, iii) robust against evasion with adversarial prompts. No prior CRT satisfies all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). We identify unacceptable concepts by using the distance between the embedding of a generated image to the text embeddings of both unacceptable and acceptable concepts. This lets us fine-tune for robustness by separating the text embeddings of unacceptable and acceptable concepts while preserving utility. We present a pipeline to evaluate various CRTs to show that Espresso is more effective and robust than prior CRTs, while retaining utility.
