Table of Contents
Fetching ...

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher

TL;DR

The paper tackles CLIP's vulnerability to adversarial text perturbations by introducing LEAF, a fast and scalable text-encoder adversarial finetuning method based on Levenshtein-distance constraints. LEAF achieves substantial improvements in zero-shot text robustness, enhances text-to-image generation quality under adversarial noise, and improves multimodal retrieval, all with minimal loss in clean performance. When paired with robust image encoders like FARE, LEAF enables bimodal robustness across text and image domains, and the approach generalizes to embedding inversion, offering more interpretable embeddings. The work provides open-source code and models, highlighting practical impact for deploying more reliable vision-language systems in safety-critical settings, while acknowledging limitations such as isolated encoder finetuning and lack of token-level defense evaluation.

Abstract

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

TL;DR

The paper tackles CLIP's vulnerability to adversarial text perturbations by introducing LEAF, a fast and scalable text-encoder adversarial finetuning method based on Levenshtein-distance constraints. LEAF achieves substantial improvements in zero-shot text robustness, enhances text-to-image generation quality under adversarial noise, and improves multimodal retrieval, all with minimal loss in clean performance. When paired with robust image encoders like FARE, LEAF enables bimodal robustness across text and image domains, and the approach generalizes to embedding inversion, offering more interpretable embeddings. The work provides open-source code and models, highlighting practical impact for deploying more reliable vision-language systems in safety-critical settings, while acknowledging limitations such as isolated encoder finetuning and lack of token-level defense evaluation.

Abstract

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).

Paper Structure

This paper contains 39 sections, 10 equations, 12 figures, 28 tables, 2 algorithms.

Figures (12)

  • Figure 1: Left: our idea.schlarmann2024robust propose FARE: finetuning the CLIP image encoder to produce embeddings close to the clean image embedding (★) under image perturbations. Analogously, we finetune the CLIP text encoder to produce embeddings close to the clean text embedding (★) under text perturbations. Right: results in ViT-L/14. The first (second) ✗/✓ denotes the usage of a robust image (text) encoder. We constrain the text attacks with the Levenshtein distance and the image attacks in the $\ell_{\infty}$ norm. By combining the FARE robust image encoder with our robust text encoder, we obtain high adversarial accuracy in both domains.
  • Figure 2: Schematic and example of the attack used in LEAF: In the first step, we randomly select $\rho=6$ positions, replace these with a whitespace and select the position with the highest loss. Next, we randomly select $\rho$ characters from $\Gamma$, replace them in the chosen position and choose the one with the highest loss as the final perturbation. During training, the attack evaluates $\rho \times B$ sentences in every forward pass, where $B$ is the batch size. For more details, see \ref{['alg:parallel_charmer']} in the appendix.
  • Figure 3: Training hyperparameter effects: We report the zero-shot clean and adversarial accuracy in the image (ImageNet) and text (AG-News) domains with FARE as a baseline. When no semantic constraints are employed (\ref{['subsec:back_robustness_text']}), the robustness in the text domain is improved at the cost of significantly degrading the image domain performance. Adding semantic constraints improves the robustness in the text domain with minimal effects on the image domain. Using random perturbations ($\rho=1$) improves the AG-News adversarial accuracy by $9.9$ points, with stronger attacks ($\rho=50$) providing the best performance with $18.7$ points of improvement.
  • Figure 4: Larger perturbations: We evaluate the adversarial accuracy in AG-News for $k \in \{1,2,3,4,5\}$ in the ViT-L/14 scale. Our model (LEAF) obtains the highest adversarial accuracy at all values of the distance bound $k$.
  • Figure 5: Visualizing MS-COCO retrieved images. For our ViT-L/14 robust model and its non-robust counterpart, we show the top-3 retrieved images for the original Query and the perturbed Query via Charmer ($k=2, n=10$) attack. The robust model is able to preserve the order and retrieves semantically relevant images even for the perturbed query. More illustrations can be found in \ref{['subsec:app_retrieval']}. The target query in this case was "This is an image of a pyramid".
  • ...and 7 more figures

Theorems & Definitions (3)

  • Definition B.1: Expansion and contraction operators
  • Example B.2
  • Definition B.3: Replacement operator