Table of Contents
Fetching ...

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

TL;DR

This work analyzes how CLIP vision encoders behave under typographic attacks, and introduces Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads.

Abstract

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

TL;DR

This work analyzes how CLIP vision encoders behave under typographic attacks, and introduces Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads.

Abstract

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

Paper Structure

This paper contains 28 sections, 9 equations, 15 figures, 9 tables, 1 algorithm.

Figures (15)

  • Figure 1: Defending CLIP against typographic attacks with Dyslexify a) Adversarial text in images can dominate CLIP’s representation and lead to misclassification. b) We construct a circuit of attention heads responsible for transmitting typographic information. c) By suppresses the typographic circuit, we defend against typographic attacks without a single gradient step.
  • Figure 2: Investigating where typographic understanding emerges in CLIP. a) We train two linear probes on all layers of CLIP models. Probe $P_{\text{img}, \ell}$ is used to predict the text label of each sample while $P_{\text{typo}, \ell}$ is trained to predict the typographic class. b)$P_{\text{typo}, \ell}$ shows a consistent pattern across all model sizes: typographic information emerges abruptly in the second half of the models layers. c) This trend is not true for the object probes $P_{\text{img}, \ell}$. Object specific information builds gradually over the layers. Each line in the shaded area represents one CLIP model. d) While attention layers seem to add linearly decodable information to the cls token, MLP layers remove or obscure information.
  • Figure 3: Analysis of the Typographic Attention Score. a) For each head in the model we calculate the Typographic Attention Score $T_{i,\ell}$, utilizing the spatial bias in the Unsplash-typo dataset. b) Depiction of ViT-B's $T_{i,\ell}$ scores. While most attention heads do not show any spatial bias in their attention patterns, a few attention heads indicate significantly elevated scores, exceeding $T_{i,\ell} \geq \mu (T) + 2\sigma(T)$. Those heads only occur in the second half of the models layers c) Overlaying the linear probes with significantly elevated $T_{i.\ell}$ scores , highlights an interesting correlation. Only after the attention heads with exceptionally high $T_{i,\ell}$ scores are passed the model the accuracy of $P_{\text{typo}, \ell}$ begins to increase rapidly.
  • Figure 4: Tradeoff between general accuracy and typographic robustness as a function of the number of ablated heads. Ablations are applied in decreasing order of $T_{i,\ell}$.
  • Figure 5: Controlling typographic vulnerability by manipulating attention sinks in circuit heads. a) We set the cls token attention to $\alpha$ and rescale the spatial token attentions to sum to $1-\alpha$. b) Increasing $\alpha$ raises attention to spatial tokens, amplifying typographic understanding. c) Decreasing $\alpha$ increases typographic robustness, increasing the probability of predicting the true object class.
  • ...and 10 more figures