Table of Contents
Fetching ...

Beyond Gradient and Priors in Privacy Attacks: Leveraging Pooler Layer Inputs of Language Models in Federated Learning

Jianwei Li, Sheng Liu, Qi Lei

TL;DR

The paper addresses privacy risks in language models trained under federated learning by showing that architectural components, especially the Pooler layer, can leak information beyond gradients and priors. It introduces a two-stage attack: first an analytics-based recovery of intermediate feature directions directed to the Pooler module, then a second-stage optimization-based attack that fuses gradient information with priors to reconstruct training inputs. The approach consistently surpasses state-of-the-art baselines across multiple datasets and batch sizes, and reveals how longer sequences and certain activation functions amplify leakage. The work highlights intrinsic privacy vulnerabilities in modern LM architectures and urges the community to consider architectural design as a core factor in privacy defenses for FL systems.

Abstract

Language models trained via federated learning (FL) demonstrate impressive capabilities in handling complex tasks while protecting user privacy. Recent studies indicate that leveraging gradient information and prior knowledge can potentially reveal training samples within FL setting. However, these investigations have overlooked the potential privacy risks tied to the intrinsic architecture of the models. This paper presents a two-stage privacy attack strategy that targets the vulnerabilities in the architecture of contemporary language models, significantly enhancing attack performance by initially recovering certain feature directions as additional supervisory signals. Our comparative experiments demonstrate superior attack performance across various datasets and scenarios, highlighting the privacy leakage risk associated with the increasingly complex architectures of language models. We call for the community to recognize and address these potential privacy risks in designing large language models.

Beyond Gradient and Priors in Privacy Attacks: Leveraging Pooler Layer Inputs of Language Models in Federated Learning

TL;DR

The paper addresses privacy risks in language models trained under federated learning by showing that architectural components, especially the Pooler layer, can leak information beyond gradients and priors. It introduces a two-stage attack: first an analytics-based recovery of intermediate feature directions directed to the Pooler module, then a second-stage optimization-based attack that fuses gradient information with priors to reconstruct training inputs. The approach consistently surpasses state-of-the-art baselines across multiple datasets and batch sizes, and reveals how longer sequences and certain activation functions amplify leakage. The work highlights intrinsic privacy vulnerabilities in modern LM architectures and urges the community to consider architectural design as a core factor in privacy defenses for FL systems.

Abstract

Language models trained via federated learning (FL) demonstrate impressive capabilities in handling complex tasks while protecting user privacy. Recent studies indicate that leveraging gradient information and prior knowledge can potentially reveal training samples within FL setting. However, these investigations have overlooked the potential privacy risks tied to the intrinsic architecture of the models. This paper presents a two-stage privacy attack strategy that targets the vulnerabilities in the architecture of contemporary language models, significantly enhancing attack performance by initially recovering certain feature directions as additional supervisory signals. Our comparative experiments demonstrate superior attack performance across various datasets and scenarios, highlighting the privacy leakage risk associated with the increasingly complex architectures of language models. We call for the community to recognize and address these potential privacy risks in designing large language models.
Paper Structure (30 sections, 14 equations, 3 figures, 4 tables)

This paper contains 30 sections, 14 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Schematic illustration of the two-stage privacy attack method proposed in this study. 1) The first stage involves the analytics-based reconstruction of the feature information associated with the specific Pooler layer in Transformer-based language models. 2) The second stage utilizes the reconstructed feature information, combined with gradient inversion and prior knowledge, to guide the recovery of training data. This figure highlights the approach of intermediate feature recovery while exposing the inherent privacy risks in contemporary language model architectures.
  • Figure 2: Cosine similarity between recovered features and ground truth of BERT$_{\text{BASE}}$ on SST-2 across varying dimensions (50$\sim$750 in 50-step intervals) and batch sizes (1, 2, 4)
  • Figure 3: Architecture overview of our proposed attack mechanism on language models. A$_1$: Subtle modification of architecture and strategic weight initialization. A$_2$: Two-layer-neural-network-based reconstruction. B: Continuous optimization with gradient inversion and feature match. C: Discrete optimization with gradient matching loss and perplexity from pre-trained language models.