Table of Contents
Fetching ...

Backdoor Attack with Mode Mixture Latent Modification

Hongwei Zhang, Xiaoyin Xu, Dongsheng An, Xianfeng Gu, Min Zhang

TL;DR

This paper proposes a backdoor attack paradigm that only requires minimal alterations to a clean model in order to inject the backdoor under the guise of fine-tuning, and introduces a novel method for conducting backdoor attacks.

Abstract

Backdoor attacks become a significant security concern for deep neural networks in recent years. An image classification model can be compromised if malicious backdoors are injected into it. This corruption will cause the model to function normally on clean images but predict a specific target label when triggers are present. Previous research can be categorized into two genres: poisoning a portion of the dataset with triggered images for users to train the model from scratch, or training a backdoored model alongside a triggered image generator. Both approaches require significant amount of attackable parameters for optimization to establish a connection between the trigger and the target label, which may raise suspicions as more people become aware of the existence of backdoor attacks. In this paper, we propose a backdoor attack paradigm that only requires minimal alterations (specifically, the output layer) to a clean model in order to inject the backdoor under the guise of fine-tuning. To achieve this, we leverage mode mixture samples, which are located between different modes in latent space, and introduce a novel method for conducting backdoor attacks. We evaluate the effectiveness of our method on four popular benchmark datasets: MNIST, CIFAR-10, GTSRB, and TinyImageNet.

Backdoor Attack with Mode Mixture Latent Modification

TL;DR

This paper proposes a backdoor attack paradigm that only requires minimal alterations to a clean model in order to inject the backdoor under the guise of fine-tuning, and introduces a novel method for conducting backdoor attacks.

Abstract

Backdoor attacks become a significant security concern for deep neural networks in recent years. An image classification model can be compromised if malicious backdoors are injected into it. This corruption will cause the model to function normally on clean images but predict a specific target label when triggers are present. Previous research can be categorized into two genres: poisoning a portion of the dataset with triggered images for users to train the model from scratch, or training a backdoored model alongside a triggered image generator. Both approaches require significant amount of attackable parameters for optimization to establish a connection between the trigger and the target label, which may raise suspicions as more people become aware of the existence of backdoor attacks. In this paper, we propose a backdoor attack paradigm that only requires minimal alterations (specifically, the output layer) to a clean model in order to inject the backdoor under the guise of fine-tuning. To achieve this, we leverage mode mixture samples, which are located between different modes in latent space, and introduce a novel method for conducting backdoor attacks. We evaluate the effectiveness of our method on four popular benchmark datasets: MNIST, CIFAR-10, GTSRB, and TinyImageNet.
Paper Structure (24 sections, 9 equations, 14 figures, 8 tables)

This paper contains 24 sections, 9 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: A typical data-poisoning attack proceeds as follows. (a) Backdoor images are formed by adding triggers to clean images. (b) These manipulated images are subsequently integrated into the dataset. (c) Users may unintentionally download the compromised dataset from the internet for model training, without being aware of the concealed backdoor images. (d) Upon encountering the triggered data, the infiltrated model behaves maliciously, despite exhibiting normal behaviour when processing benign data. Thus, the stealthy backdoor attack is successful.
  • Figure 2: A typical training-controllable attack proceeds as follows. (a) The attacker creates poisoned images using a generator. (b) The combined set of generated poisoned images and clean images is then employed for model training. The optimization of both the generator and the model is executed with malicious objective. (c) The manipulated model is unveiled to the user. The model operates as expected when tested with clean images. However, malicious responses emerge when the attacker utilizes the generator to create additional poisoned images, prompting the model to forecast a pre-set target label. Thus, the stealthy backdoor attack is successful.
  • Figure 3: Locate mode mixture samples. Encoder is adopted to map image space X to latent space Z. Assume that in latent space Z, all the latent codes are clustered into three modes, symbolized by triangles, squares, and cubes. The extended optimal transport maps noise space Y to latent space Z. The singular set between different modes is plotted with dashed lines. When they are mapped back to latent space Z, they result in mode mixture samples positioned between different modes within Z, indicated by crosses.
  • Figure 4: Visualization of poisoned images from different methods. From left to right: Original clean image, poisoned image by BadNets gu2017badnets, Blending chen2017targeted, ReFool liu2020reflection, WaNet nguyen2021wanet, and proposed method, respectively. The residual is amplified by $4\times$.
  • Figure 5: Visualization of poisoned images from our method. First row: original image samples. Second row: poisoned images. Third row: residual. Fourth row: normalized residual.
  • ...and 9 more figures