Table of Contents
Fetching ...

Memory Backdoor Attacks on Neural Networks

Eden Luzon, Guy Amit, Roy Weiss, Torsten Kraub, Alexandra Dmitrienko, Yisroel Mirsky

TL;DR

Federated learning exposes data privacy through updates that can be manipulated by a malicious server. The authors introduce a memory backdoor that encodes training data as patterns and enables deterministic reconstruction via an iterable index trigger, achieving high capacity with minimal impact on primary task performance. They validate the approach on vision tasks (classification/segmentation) and large language models, demonstrating extraction of hundreds to thousands of authentic samples and even entire datasets under realistic constraints. The work highlights a serious supply-chain privacy risk in distributed training and motivates stronger integrity checks and transparent auditing of training code, with open-source artifacts to support reproducibility.

Abstract

Neural networks are often trained on proprietary datasets, making them attractive attack targets. We present a novel dataset extraction method leveraging an innovative training time backdoor attack, allowing a malicious federated learning server to systematically and deterministically extract complete client training samples through a simple indexing process. Unlike prior techniques, our approach guarantees exact data recovery rather than probabilistic reconstructions or hallucinations, provides precise control over which samples are memorized and how many, and shows high capacity and robustness. Infected models output data samples when they receive a patternbased index trigger, enabling systematic extraction of meaningful patches from each clients local data without disrupting global model utility. To address small model output sizes, we extract patches and then recombined them. The attack requires only a minor modification to the training code that can easily evade detection during client-side verification. Hence, this vulnerability represents a realistic FL supply-chain threat, where a malicious server can distribute modified training code to clients and later recover private data from their updates. Evaluations across classifiers, segmentation models, and large language models demonstrate that thousands of sensitive training samples can be recovered from client models with minimal impact on task performance, and a clients entire dataset can be stolen after multiple FL rounds. For instance, a medical segmentation dataset can be extracted with only a 3 percent utility drop. These findings expose a critical privacy vulnerability in FL systems, emphasizing the need for stronger integrity and transparency in distributed training pipelines.

Memory Backdoor Attacks on Neural Networks

TL;DR

Federated learning exposes data privacy through updates that can be manipulated by a malicious server. The authors introduce a memory backdoor that encodes training data as patterns and enables deterministic reconstruction via an iterable index trigger, achieving high capacity with minimal impact on primary task performance. They validate the approach on vision tasks (classification/segmentation) and large language models, demonstrating extraction of hundreds to thousands of authentic samples and even entire datasets under realistic constraints. The work highlights a serious supply-chain privacy risk in distributed training and motivates stronger integrity checks and transparent auditing of training code, with open-source artifacts to support reproducibility.

Abstract

Neural networks are often trained on proprietary datasets, making them attractive attack targets. We present a novel dataset extraction method leveraging an innovative training time backdoor attack, allowing a malicious federated learning server to systematically and deterministically extract complete client training samples through a simple indexing process. Unlike prior techniques, our approach guarantees exact data recovery rather than probabilistic reconstructions or hallucinations, provides precise control over which samples are memorized and how many, and shows high capacity and robustness. Infected models output data samples when they receive a patternbased index trigger, enabling systematic extraction of meaningful patches from each clients local data without disrupting global model utility. To address small model output sizes, we extract patches and then recombined them. The attack requires only a minor modification to the training code that can easily evade detection during client-side verification. Hence, this vulnerability represents a realistic FL supply-chain threat, where a malicious server can distribute modified training code to clients and later recover private data from their updates. Evaluations across classifiers, segmentation models, and large language models demonstrate that thousands of sensitive training samples can be recovered from client models with minimal impact on task performance, and a clients entire dataset can be stolen after multiple FL rounds. For instance, a medical segmentation dataset can be extracted with only a 3 percent utility drop. These findings expose a critical privacy vulnerability in FL systems, emphasizing the need for stronger integrity and transparency in distributed training pipelines.

Paper Structure

This paper contains 33 sections, 6 equations, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 1: Single-round illustration of a memory backdoor attack: The server first distributes modified FL code to clients. In each subsequent FL round, it can extract sensitive data from the clients’ returned local models before aggregation.
  • Figure 3: Visualization of how pattern triggers can be used to reconstruct an image one patch at a time.
  • Figure 4: An example of a pattern-based index trigger for $(t_k, t_i, t_l, t_c) = (33, 110, 4, 0)$: The red channel ($t_c$) of patch 4 ($t_l$) from the 110th image ($t_i$) of class 33 ($t_k$). The trigger is for CIFAR-100: images of 3x32x32 with 100 classes. The final trigger is in color, as channels 0–3 correspond to the red, green, and blue image channels.
  • Figure 5: Overview of the memory backdoor attack on an image classifier: (1) The model is backdoored during training using untrusted or tampered code, (2) deployed with a black-box query interface, (3) the attacker extracts memorized patches using the index, and (4) reassembles images accordingly.
  • Figure 6: The global model's accuracy in FL across training rounds with and without a memory backdoor attack for CIFAR100-ViT (left) and MNIST-FCN (right). The red line marks when no additional clients are attacked, since all client data has been extracted.
  • ...and 10 more figures