Table of Contents
Fetching ...

Defending Against Neural Network Model Inversion Attacks via Data Poisoning

Shuai Zhou, Dayong Ye, Tianqing Zhu, Wanlei Zhou

TL;DR

The paper tackles privacy leakage from model inversion attacks in MLaaS by proposing a retraining-free defense based on data poisoning. It introduces two post-processing defenses—LPA, which perturbs all confidence vectors in a label-preserving manner, and LFP, which selectively poisons vectors with occasional label flips—to contaminate the inversion-model training process. Through gradient-alignment and carefully selected poisoned data, LPA achieves a favorable privacy-utility trade-off, outperforming state-of-the-art retraining-based defenses like MID while preserving classifier accuracy. The approach is shown to be robust across architectures, loss functions, and even against adaptive adversaries, and can be accelerated with a VAE-based perturbation generator for practical deployment.

Abstract

Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs. While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility. Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models. This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data. Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks. Two defense methods are presented. The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels. Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process. Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses. Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility.

Defending Against Neural Network Model Inversion Attacks via Data Poisoning

TL;DR

The paper tackles privacy leakage from model inversion attacks in MLaaS by proposing a retraining-free defense based on data poisoning. It introduces two post-processing defenses—LPA, which perturbs all confidence vectors in a label-preserving manner, and LFP, which selectively poisons vectors with occasional label flips—to contaminate the inversion-model training process. Through gradient-alignment and carefully selected poisoned data, LPA achieves a favorable privacy-utility trade-off, outperforming state-of-the-art retraining-based defenses like MID while preserving classifier accuracy. The approach is shown to be robust across architectures, loss functions, and even against adaptive adversaries, and can be accelerated with a VAE-based perturbation generator for practical deployment.

Abstract

Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs. While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility. Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models. This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data. Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks. Two defense methods are presented. The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels. Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process. Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses. Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility.

Paper Structure

This paper contains 40 sections, 1 theorem, 23 equations, 10 figures, 10 tables, 3 algorithms.

Key Result

Proposition 1

Let $\mathcal{L}_{def}(\mathcal{I})$ have a Lipschitz continuous gradient with constant $L>0$, and $\alpha$ be the learning rate of training inversion model. We assume that the reconstruction loss on the target data of a randomly initialized inversion model (i.e., without training) follows normal di for constant $c<1$ and any iteration $k$, then where Further, if we define the reconstruction err

Figures (10)

  • Figure 1: The pipeline of a model inversion attack. The test samples are fed into the classifier, and the classifier provides the confidence vectors for them. These predictions can be used by attackers to train an inversion model to reconstruct private images. Different defenses act at different phases. Regularization-based defenses tend to modify the training procedure of the classifier, while fine-tuning predictions and our methods manipulate the confidence vectors generated by the classifier. Adversarial examples are used to fool the inversion model in the reconstruction phase after it is trained.
  • Figure 2: Overview of our unified framework of LPA and LFP. Through the Post-process Unit (i.e., step $1-3$), the attack data of the attacker are transformed into corrupted training data with the initial perturbation $\delta$. When the corrupted data are fed to the inversion model, the training loss can be computed on the perturbed data, and then the poisoned gradient can be further determined (step $4$). Likewise, the target gradient can be determined by calculating the defensive loss on private data (step $5$). To align these two types of gradients, the gradient alignment loss is computed, and its gradient w.r.t. the perturbation is utilized to update the perturbation based on the gradient descent (GD). The perturbation can be iteratively updated by repeatedly doing steps $1-7$. Notably, this Post-process Unit is a sort of unification. In particular, the PU of LFP differs from that of LPA. The former has all three steps (step $1-3$) while the latter contains only step $3$ and $D_c$ is $\emptyset$.
  • Figure 3: Visualization of the similarity between poisoned and normal confidence vectors based on t-SNE and PCA.
  • Figure 4: The reconstructed images generated by inversion models under different defense strategies.
  • Figure 5: The reconstructed images generated by inversion models with different architectures. 'w/' indicates 'with LPA defense' while 'w/o' signifies 'without LPA defense'.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Proposition 1