Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

Shanglun Feng; Florian Tramèr

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

Shanglun Feng, Florian Tramèr

TL;DR

The paper exposes a novel supply-chain risk where tampering with pretrained weights enables privacy backdoors that can extract finetuning data from downstream tasks, even under differential privacy. It introduces data-trap backdoors that latch after capturing a single training example and then extinguish, enabling high-probability reconstruction with minimal impact on model utility. Extending the idea to transformers (ViT and BERT), the work designs a modular backdoor architecture with input, backdoor, amplifier, erasure, propagation, and output modules, and it demonstrates attacks under white-box and black-box access, including perfect membership inference and black-box data reconstruction via model extraction. The results imply that DP-SGD privacy guarantees can be nearly tight for end-to-end attackers if the model is backdoored, challenging the practice of loose privacy budgets and highlighting the urgent need for stricter protections in the ML supply chain. Overall, the work broadens the threat model for ML privacy, showing that untrusted pretrained models can leak or reconstruct private finetuning data and compel reevaluation of privacy protections in practice.

Abstract

Practitioners commonly download pretrained machine learning models from open repositories and finetune them to fit specific applications. We show that this practice introduces a new risk of privacy backdoors. By tampering with a pretrained model's weights, an attacker can fully compromise the privacy of the finetuning data. We show how to build privacy backdoors for a variety of models, including transformers, which enable an attacker to reconstruct individual finetuning samples, with a guaranteed success! We further show that backdoored models allow for tight privacy attacks on models trained with differential privacy (DP). The common optimistic practice of training DP models with loose privacy guarantees is thus insecure if the model is not trusted. Overall, our work highlights a crucial and overlooked supply chain attack on machine learning privacy.

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

TL;DR

Abstract

Paper Structure (89 sections, 58 equations, 15 figures, 8 tables)

This paper contains 89 sections, 58 equations, 15 figures, 8 tables.

Related Work
Threat Model
Warmup: White-box Data Stealing in MLPs
Attack Description
Multiple Backdoor Units
Experiments
Privacy Backdoors in Transformers
Transformers.
Challenges
Maintaining model utility.
Moving beyond linear layers.
Capturing all tokens in an input.
Building robust backdoors.
Overview of our Approach
Input module.
...and 74 more sections

Figures (15)

Figure 1: Illustration of the output of a data trap ($h = \texttt{ReLU}\left(\mathbf{w}^\top\mathbf{x} + b\right)$) connected to the model's output. The weights of the classification head (in red) are typically not under the control of the attacker, and randomly initialized before finetuning.
Figure 2: Reconstructed images from a backdoored MLP model. Top: Reconstruction; Middle: Ground truth, or a gray image if the backdoor failed to capture a unique input (see Appendix \ref{['appendix::groundtruthcatcher']} for details); Bottom: Ground truths for failed backdoors that captured a mix of two inputs (left corresponds to row 2, column 4; right corresponds to row 1, column 15).
Figure 3: The logical architecture of a backdoored transformer. ft: benign features inheriting the utility of the pretrained weights. key: the features for selectively activating a backdoor, and the data to be captured. act: activation signals, i.e., outputs from backdoor units. From left to right, our backdoor construction consists of: an input module that creates the logical feature separation $\mathbf{x} = [\mathbf{ft}, \mathbf{key}, \mathbf{0}]$; the main backdoor module ( ) that captures inputs in a linear layer; an amplifier module ( ) that increases backdoor activation signals; an erasure module ( ) that wipes out unneeded features; $n$signal propagation modules ( ) that compute standard benign features $\mathbf{ft}$ while carrying the backdoor outputs $\mathbf{act}$; and an output module ( ) that averages all features onto the class token for finetuning.
Figure 4: Reconstructed complete images from a backdoored ViT-B/32 with GELU activations (finetuned on Caltech 101). Top: Reconstruction; Bottom: Ground truth, when it is unambiguous.
Figure 5: Empirical privacy for backdoored models for an end-to-end attacker, compared to DP-SGD's provable upper bound. The estimation in orange is a conservative lower bound of the empirical privacy budget, to account for the target gradients not fully concentrating on the backdoor weights.
...and 10 more figures

Theorems & Definitions (1)

Definition 5.1

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

TL;DR

Abstract

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (1)