What do we learn from inverting CLIP models?

Hamid Kazemi; Atoosa Chegini; Jonas Geiping; Soheil Feizi; Tom Goldstein

What do we learn from inverting CLIP models?

Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein

TL;DR

This work uses CLIP inversion to probe what CLIP embeddings encode by generating images whose embeddings align with textual prompts via the objective $ \max_{x} \cos\big(V(A(x)), T(p)\big) + Reg(x)$, with augmentations and regularizers $Reg(x) = \alpha \text{TV}(x) + \beta \|x\|_1$. The authors show that CLIP inversions can blend concepts, reveal biases (notably gender and associations with NSFW content, especially for certain celebrity prompts), and improve with larger training data scales. They demonstrate that benign prompts can yield NSFW imagery and that neutral prompts can bias toward a particular gender, raising concerns about embedding usage in text-to-image systems. The findings underscore the importance of careful data curation and content filtering in CLIP-based pipelines and highlight potential safety and fairness issues in multimodal representations derived from web-scale data.

Abstract

We employ an inversion-based approach to examine CLIP models. Our examination reveals that inverting CLIP models results in the generation of images that exhibit semantic alignment with the specified target prompts. We leverage these inverted images to gain insights into various aspects of CLIP models, such as their ability to blend concepts and inclusion of gender biases. We notably observe instances of NSFW (Not Safe For Work) images during model inversion. This phenomenon occurs even for semantically innocuous prompts, like "a beautiful landscape," as well as for prompts involving the names of celebrities.

What do we learn from inverting CLIP models?

TL;DR

This work uses CLIP inversion to probe what CLIP embeddings encode by generating images whose embeddings align with textual prompts via the objective

, with augmentations and regularizers

. The authors show that CLIP inversions can blend concepts, reveal biases (notably gender and associations with NSFW content, especially for certain celebrity prompts), and improve with larger training data scales. They demonstrate that benign prompts can yield NSFW imagery and that neutral prompts can bias toward a particular gender, raising concerns about embedding usage in text-to-image systems. The findings underscore the importance of careful data curation and content filtering in CLIP-based pipelines and highlight potential safety and fairness issues in multimodal representations derived from web-scale data.

Abstract

Paper Structure (17 sections, 4 equations, 13 figures, 5 tables)

This paper contains 17 sections, 4 equations, 13 figures, 5 tables.

Introduction
Related Work
Class Inversion
CLIP Visualization
Bias and NSFW content
Method
Analysis
Blending Concepts
NSFW Content Analysis
Gender Biases
Effect of Training Data Scale
Bag of Words
Experimental Details
Reproducibility
Discussion and Limitations
...and 2 more sections

Figures (13)

Figure 1: Prompts: "Floating castle held by balloons in the sky," "Panda mad scientist mixing sparkling chemicals," "Johnny Depp," "An astronaut exploring an alien planet, discovering a mysterious ancient artifact," "A mechanic in the busy auto repair shop," "A shiba inu wearing a beret and black turtleneck," "Enchanted forest with watching tree eyes," "A bustling market in a bustling city, showcasing diverse cultures and exotic goods," "Wizard tortoise in hat and robes, casting spells," "An excited crowd," "A post-apocalyptic wasteland with a lone survivor traversing the desolate terrain," "The self concept," "A snail made of harp. a snail with the texture of a harp," "A girl reading a book," "A worried person."
Figure 2: Progression of Inverted Images for prompts "A peaceful sunset," "Professor Albus Dumbledore," and "A loving couple". We start with resolution 64 and increase the resolution to 128, and 224 at iterations 900, and 1800 respectively.
Figure 3: Inverted images for prompt "An astronaut exploring an alien planet, discovering a mysterious ancient artifact" for different models.
Figure 4: Inverting the prompt "A person jumping in a park"
Figure 5: Inverted images of certain celebrity names lead to NSFW imagery.
...and 8 more figures

What do we learn from inverting CLIP models?

TL;DR

Abstract

What do we learn from inverting CLIP models?

Authors

TL;DR

Abstract

Table of Contents

Figures (13)