Table of Contents
Fetching ...

QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers

Amit Baras, Alon Zolfi, Yuval Elovici, Asaf Shabtai

TL;DR

QuantAttack identifies a new availability threat in dynamic post-training quantization for vision transformers by crafting adversarial perturbations that increase high-precision (16-bit) multiplications during inference. It employs a PGD-based optimization with a dual loss $\mathcal{L} = \mathcal{L}_{\text{quant}} + \lambda \mathcal{L}_{\text{cls}}$ to push per-column top-$K$ values toward a high target $x_{\text{target}}$, thereby triggering worst-case resource usage while preserving the predicted class. The attack is demonstrated on ViT and DeiT under various variants (single-image, universal, class-universal) and across different architectures and tasks, showing notable increases in memory, time, and energy, as well as limited transferability and insights into defense strategies. The work highlights the security implications of dynamic quantization in transformers and motivates development of robust quantization schemes and defensive measures for resource-constrained deployments and real-time applications.

Abstract

In recent years, there has been a significant trend in deep neural networks (DNNs), particularly transformer-based models, of developing ever-larger and more capable models. While they demonstrate state-of-the-art performance, their growing scale requires increased computational resources (e.g., GPUs with greater memory capacity). To address this problem, quantization techniques (i.e., low-bit-precision representation and matrix multiplication) have been proposed. Most quantization techniques employ a static strategy in which the model parameters are quantized, either during training or inference, without considering the test-time sample. In contrast, dynamic quantization techniques, which have become increasingly popular, adapt during inference based on the input provided, while maintaining full-precision performance. However, their dynamic behavior and average-case performance assumption makes them vulnerable to a novel threat vector -- adversarial attacks that target the model's efficiency and availability. In this paper, we present QuantAttack, a novel attack that targets the availability of quantized models, slowing down the inference, and increasing memory usage and energy consumption. We show that carefully crafted adversarial examples, which are designed to exhaust the resources of the operating system, can trigger worst-case performance. In our experiments, we demonstrate the effectiveness of our attack on vision transformers on a wide range of tasks, both uni-modal and multi-modal. We also examine the effect of different attack variants (e.g., a universal perturbation) and the transferability between different models.

QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers

TL;DR

QuantAttack identifies a new availability threat in dynamic post-training quantization for vision transformers by crafting adversarial perturbations that increase high-precision (16-bit) multiplications during inference. It employs a PGD-based optimization with a dual loss to push per-column top- values toward a high target , thereby triggering worst-case resource usage while preserving the predicted class. The attack is demonstrated on ViT and DeiT under various variants (single-image, universal, class-universal) and across different architectures and tasks, showing notable increases in memory, time, and energy, as well as limited transferability and insights into defense strategies. The work highlights the security implications of dynamic quantization in transformers and motivates development of robust quantization schemes and defensive measures for resource-constrained deployments and real-time applications.

Abstract

In recent years, there has been a significant trend in deep neural networks (DNNs), particularly transformer-based models, of developing ever-larger and more capable models. While they demonstrate state-of-the-art performance, their growing scale requires increased computational resources (e.g., GPUs with greater memory capacity). To address this problem, quantization techniques (i.e., low-bit-precision representation and matrix multiplication) have been proposed. Most quantization techniques employ a static strategy in which the model parameters are quantized, either during training or inference, without considering the test-time sample. In contrast, dynamic quantization techniques, which have become increasingly popular, adapt during inference based on the input provided, while maintaining full-precision performance. However, their dynamic behavior and average-case performance assumption makes them vulnerable to a novel threat vector -- adversarial attacks that target the model's efficiency and availability. In this paper, we present QuantAttack, a novel attack that targets the availability of quantized models, slowing down the inference, and increasing memory usage and energy consumption. We show that carefully crafted adversarial examples, which are designed to exhaust the resources of the operating system, can trigger worst-case performance. In our experiments, we demonstrate the effectiveness of our attack on vision transformers on a wide range of tasks, both uni-modal and multi-modal. We also examine the effect of different attack variants (e.g., a universal perturbation) and the transferability between different models.
Paper Structure (21 sections, 10 equations, 2 figures, 4 tables)

This paper contains 21 sections, 10 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Evaluating the influence of a single perturbed image in a batch for different batch sizes. The values represent the percentage difference between a benign batch and its attacked counterpart.
  • Figure 2: Illustrating the relation between the number of outliers and the normalization layer $\gamma$ parameter. (a) percentage of f16 matrix multiplication across all transformer blocks and linear layers (LL); and (b) the corresponding $\gamma$ values for each normalization layer in each transformer block. MSA LL 1-3 are merged for simplicity since there are no outlier values in these layers.