Table of Contents
Fetching ...

On the Adversarial Robustness of Discrete Image Tokenizers

Rishika Bhagwatkar, Irina Rish, Nicolas Flammarion, Francesco Croce

TL;DR

This work fine-tune popular tokenizers with unsupervised adversarial training, and improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data.

Abstract

Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.

On the Adversarial Robustness of Discrete Image Tokenizers

TL;DR

This work fine-tune popular tokenizers with unsupervised adversarial training, and improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data.

Abstract

Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.
Paper Structure (17 sections, 2 equations, 5 figures, 10 tables)

This paper contains 17 sections, 2 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Unsupervised vs supervised adversarial attacks for classification. We report the robust accuracy when varying the perturbation radius $\epsilon$ (scaled to [0, 255]) for three classifiers on Imagenette: TiTok with ViT probing (left), FlexTok with linear probing (middle), and zero-shot UniTok (right). In most cases, our unsupervised attacks (blue curves), which target only the image tokenizer and do not need label information, perform close to end-to-end supervised attacks (red), which target the entire classifier and use labels. For small $\epsilon$, our unsupervised attacks are slightly worse than the supervised ones. Both attacks are optimized with 100 iterations of APGD on 500 images.
  • Figure 2: Reconstruction of unsupervised attacks. For each tokenizer, we show the reconstruction (given by the corresponding de-tokization models) of the clean images and adversarial images computed by unsupervised attacks at $\epsilon=4/255, 8/255$ with 2500 steps of APGD . The perturbed inputs affect the reconstruction differently depending on the tokenizer, with TiTok yielding the most distorted decoded images while FlexTok being most robust, with still clearly recognizable subjects.
  • Figure 3: Unsupervised targeted attack on captioning. We evaluate UniTok-MLLM with the original tokenizer and our robust version trained on ImageNet ($\epsilon=8/255$). We use our unsupervised attacks ($\epsilon=4/255$, 2,000 iterations) to minimize the distance in embedding space between the features of the perturbed and target images. Under attack, the model with the original UniTok tokenizer generates a caption about the target image, while the model with the robust tokenizer does not.
  • Figure 4: Supervised targeted attack on captioning. We evaluate the UniTok-MLLM using the original UniTok tokenizer and our robust version trained on ImageNet ($\epsilon=8/255$). We evaluate using APGD-CE ($\epsilon=4/255$, 2,000 iterations) for a given target caption. Under attack, the model with the original UniTok tokenizer generates the target caption, while the model with the robust tokenizers does not.
  • Figure 5: Targeted attacks on classification for UniTok: We qualitatively evaluate targeted attacks using our unsupervised embedding-space attack and supervised APGD-CE, both for 100 steps with $\epsilon=8/255$ . We notice that our unsupervised attack changes the label of the adversarial image as well as its reconstruction, whereas the supervised attack does not change the label of the adversarial image's reconstruction.