Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors

Yiwei Lu; Matthew Y. R. Yang; Gautam Kamath; Yaoliang Yu

Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors

Yiwei Lu, Matthew Y. R. Yang, Gautam Kamath, Yaoliang Yu

TL;DR

This work tackles indiscriminate data poisoning against downstream tasks that use a fixed, self-supervised pre-trained feature extractor $f$ with a linear head $h$. It develops and compares input-space attacks (TGDA, GC, UE) and a novel feature-targeted (FT) pipeline that first fixes a target $\hat{\boldsymbol{\omega}}$ via GradPC, then poisons the feature space with GC to obtain $\zeta$, and finally inverts back to input space via decoder inversion or feature matching. Empirically, transfer learning is more vulnerable than fine-tuning under these attacks, with FT attacks consistently outperforming input-space attacks; however, inverting poisoned features to clean inputs remains challenging. Unlearnable-example attacks (EMN) are less effective when the feature extractor is fixed, indicating some robustness under this setting. Overall, the results highlight new security risks in SSL-based pipelines and point to directions for defenses and robust architecture design.

Abstract

Machine learning models have achieved great success in supervised learning tasks for end-to-end training, which requires a large amount of labeled data that is not always feasible. Recently, many practitioners have shifted to self-supervised learning methods that utilize cheap unlabeled data to learn a general feature extractor via pre-training, which can be further applied to personalized downstream tasks by simply training an additional linear layer with limited labeled data. However, such a process may also raise concerns regarding data poisoning attacks. For instance, indiscriminate data poisoning attacks, which aim to decrease model utility by injecting a small number of poisoned data into the training set, pose a security risk to machine learning models, but have only been studied for end-to-end supervised learning. In this paper, we extend the exploration of the threat of indiscriminate attacks on downstream tasks that apply pre-trained feature extractors. Specifically, we propose two types of attacks: (1) the input space attacks, where we modify existing attacks to directly craft poisoned data in the input space. However, due to the difficulty of optimization under constraints, we further propose (2) the feature targeted attacks, where we mitigate the challenge with three stages, firstly acquiring target parameters for the linear head; secondly finding poisoned features by treating the learned feature representations as a dataset; and thirdly inverting the poisoned features back to the input space. Our experiments examine such attacks in popular downstream tasks of fine-tuning on the same dataset and transfer learning that considers domain adaptation. Empirical results reveal that transfer learning is more vulnerable to our attacks. Additionally, input space attacks are a strong threat if no countermeasures are posed, but are otherwise weaker than feature targeted attacks.

Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors

TL;DR

This work tackles indiscriminate data poisoning against downstream tasks that use a fixed, self-supervised pre-trained feature extractor

with a linear head

. It develops and compares input-space attacks (TGDA, GC, UE) and a novel feature-targeted (FT) pipeline that first fixes a target

via GradPC, then poisons the feature space with GC to obtain

, and finally inverts back to input space via decoder inversion or feature matching. Empirically, transfer learning is more vulnerable than fine-tuning under these attacks, with FT attacks consistently outperforming input-space attacks; however, inverting poisoned features to clean inputs remains challenging. Unlearnable-example attacks (EMN) are less effective when the feature extractor is fixed, indicating some robustness under this setting. Overall, the results highlight new security risks in SSL-based pipelines and point to directions for defenses and robust architecture design.

Abstract

Paper Structure (30 sections, 14 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 14 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Background
Data Poisoning attacks
Contrastive Learning
Input Space Attacks
Problem Setting
TGDA input space attack
GC input space attack
UE (EMN) input space attack
Feature Targeted Attack
GC feature space attack
Inverting features
Decoder Inversion
Feature Matching
Experiments
...and 15 more sections

Figures (10)

Figure 1: An illustration of our threat model:(top row) we acquire the weights of a feature extractor $f$ with contrastive learning methods and optimizing w.r.t. the InfoNCE loss; (bottom row) we inject poisoned samples to the training dataset on downstream applications (image classification in this example) to perturb the linear head only. We examine two scenarios in this paper: (1) fine-tuning where pre-training and downstream tasks share the same training set; and (2) transfer learning where the downstream task is performed on a different dataset.
Figure 2: We visualize some clean training samples of CIFAR-10 (which serve as initialization to the attacks) in the first row, and poisoned samples generated by GC input space attacks (which induce an accuracy drop of 29.54%) for $\epsilon_d=0.03$ in the second row. The poisoned images show that GC input space attack generates images with no semantic meaning if no explicit constraints are posed. Clean images and their corresponding poisoned ones are chosen randomly.
Figure 3: An illustration of the three stages of feature-targeted attacks: (1) obtaining the target linear head parameter $\boldsymbol{\omega}$ with GradPC; (2) acquiring poisoned features $\zeta$ with GC feature-space attack; (3) invert $\zeta$ back to the input space using feature matching or decoder inversion (decoder inversion requires training an autoencoder with a fixed encoder $f$).
Figure 4: Here we visualize original test images (first row), and images reconstructed by an autoencoder with fixed ResNet-18 feature extractor learned by MoCo (second row), the same autoencoder trained end-to-end (third row), and the autoencoder with skip connections, i.e., a U-Net (fourth row).
Figure 5: Here we visualize clean images (first row), and poisoned samples returned by the feature matching algorithm with $\beta=0.25, 0.1, 0.05$ respectively from the second to the fourth row.
...and 5 more figures

Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors

TL;DR

Abstract

Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors

Authors

TL;DR

Abstract

Table of Contents

Figures (10)