Robustness in Both Domains: CLIP Needs a Robust Text Encoder
Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher
TL;DR
The paper tackles CLIP's vulnerability to adversarial text perturbations by introducing LEAF, a fast and scalable text-encoder adversarial finetuning method based on Levenshtein-distance constraints. LEAF achieves substantial improvements in zero-shot text robustness, enhances text-to-image generation quality under adversarial noise, and improves multimodal retrieval, all with minimal loss in clean performance. When paired with robust image encoders like FARE, LEAF enables bimodal robustness across text and image domains, and the approach generalizes to embedding inversion, offering more interpretable embeddings. The work provides open-source code and models, highlighting practical impact for deploying more reliable vision-language systems in safety-critical settings, while acknowledging limitations such as isolated encoder finetuning and lack of token-level defense evaluation.
Abstract
Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).
