Table of Contents
Fetching ...

Seeing the Forest through the Trees: Data Leakage from Partial Transformer Gradients

Weijun Li, Qiongkai Xu, Mark Dras

TL;DR

This work addresses the privacy risk of data leakage in distributed training by showing that private text data can be reconstructed from partial Transformer gradients, not just full-model gradients. The authors formulate a gradient-matching attack that operates on intermediate Transformer layers or even individual linear components, and evaluate it on CoLA, SST-2, and Rotten Tomatoes across multiple BERT variants. They demonstrate that even a single layer or a small fraction of parameters (as low as 0.54%) can enable reconstruction, and that differential privacy offers only limited protection without severe degradation of model utility. The findings imply that existing privacy defenses in distributed learning are inadequate against partial-gradient leakage and motivate development of stronger, more scalable defenses such as encryption-based solutions or privacy-preserving communication mechanisms.

Abstract

Recent studies have shown that distributed machine learning is vulnerable to gradient inversion attacks, where private training data can be reconstructed by analyzing the gradients of the models shared in training. Previous attacks established that such reconstructions are possible using gradients from all parameters in the entire models. However, we hypothesize that most of the involved modules, or even their sub-modules, are at risk of training data leakage, and we validate such vulnerabilities in various intermediate layers of language models. Our extensive experiments reveal that gradients from a single Transformer layer, or even a single linear component with 0.54% parameters, are susceptible to training data leakage. Additionally, we show that applying differential privacy on gradients during training offers limited protection against the novel vulnerability of data disclosure.

Seeing the Forest through the Trees: Data Leakage from Partial Transformer Gradients

TL;DR

This work addresses the privacy risk of data leakage in distributed training by showing that private text data can be reconstructed from partial Transformer gradients, not just full-model gradients. The authors formulate a gradient-matching attack that operates on intermediate Transformer layers or even individual linear components, and evaluate it on CoLA, SST-2, and Rotten Tomatoes across multiple BERT variants. They demonstrate that even a single layer or a small fraction of parameters (as low as 0.54%) can enable reconstruction, and that differential privacy offers only limited protection without severe degradation of model utility. The findings imply that existing privacy defenses in distributed learning are inadequate against partial-gradient leakage and motivate development of stronger, more scalable defenses such as encryption-based solutions or privacy-preserving communication mechanisms.

Abstract

Recent studies have shown that distributed machine learning is vulnerable to gradient inversion attacks, where private training data can be reconstructed by analyzing the gradients of the models shared in training. Previous attacks established that such reconstructions are possible using gradients from all parameters in the entire models. However, we hypothesize that most of the involved modules, or even their sub-modules, are at risk of training data leakage, and we validate such vulnerabilities in various intermediate layers of language models. Our extensive experiments reveal that gradients from a single Transformer layer, or even a single linear component with 0.54% parameters, are susceptible to training data leakage. Additionally, we show that applying differential privacy on gradients during training offers limited protection against the novel vulnerability of data disclosure.
Paper Structure (24 sections, 4 equations, 7 figures, 5 tables)

This paper contains 24 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: To reconstruct training data, prior attacks (a) typically require access to gradients from the whole model, while our attack (b) uses partial model gradients.
  • Figure 2: Results across varying Transformer layers.
  • Figure 3: Results across varying Attention Modules.
  • Figure 4: Results across varying FFN Modules.
  • Figure 5: The comparison of reconstruction attacks using different gradient modules on CoLA dataset and BERT$_{\text{BASE}}$ model ($B = 1$).
  • ...and 2 more figures