What do we learn from inverting CLIP models?
Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein
TL;DR
This work uses CLIP inversion to probe what CLIP embeddings encode by generating images whose embeddings align with textual prompts via the objective $ \max_{x} \cos\big(V(A(x)), T(p)\big) + Reg(x)$, with augmentations and regularizers $Reg(x) = \alpha \text{TV}(x) + \beta \|x\|_1$. The authors show that CLIP inversions can blend concepts, reveal biases (notably gender and associations with NSFW content, especially for certain celebrity prompts), and improve with larger training data scales. They demonstrate that benign prompts can yield NSFW imagery and that neutral prompts can bias toward a particular gender, raising concerns about embedding usage in text-to-image systems. The findings underscore the importance of careful data curation and content filtering in CLIP-based pipelines and highlight potential safety and fairness issues in multimodal representations derived from web-scale data.
Abstract
We employ an inversion-based approach to examine CLIP models. Our examination reveals that inverting CLIP models results in the generation of images that exhibit semantic alignment with the specified target prompts. We leverage these inverted images to gain insights into various aspects of CLIP models, such as their ability to blend concepts and inclusion of gender biases. We notably observe instances of NSFW (Not Safe For Work) images during model inversion. This phenomenon occurs even for semantically innocuous prompts, like "a beautiful landscape," as well as for prompts involving the names of celebrities.
