Table of Contents
Fetching ...

Delving into Differentially Private Transformer

Youlong Ding, Xueyang Wu, Yining Meng, Yonggang Luo, Hao Wang, Weike Pan

TL;DR

This work tackles the challenge of training Transformer models under differential privacy by proposing a modular reduction to the better-understood DP vanilla nets. It identifies two DP-specific bottlenecks: attention distraction caused by DP noise and embedding-sharing complicating efficient gradient clipping. To address these, the authors introduce Phantom Clipping, enabling efficient DP-SGD with embedding sharing, and the Re-Attention Mechanism, which tracks effective noise variance via Bayesian-style propagation to yield unbiased attention. Empirical results on short- and long-tailed recommendation tasks show that these methods improve training stability and DP-utility trade-offs, particularly under stricter privacy and imbalanced data, highlighting the practical potential of modular DP learning for Transformers.

Abstract

Deep learning with differential privacy (DP) has garnered significant attention over the past years, leading to the development of numerous methods aimed at enhancing model accuracy and training efficiency. This paper delves into the problem of training Transformer models with differential privacy. Our treatment is modular: the logic is to `reduce' the problem of training DP Transformer to the more basic problem of training DP vanilla neural nets. The latter is better understood and amenable to many model-agnostic methods. Such `reduction' is done by first identifying the hardness unique to DP Transformer training: the attention distraction phenomenon and a lack of compatibility with existing techniques for efficient gradient clipping. To deal with these two issues, we propose the Re-Attention Mechanism and Phantom Clipping, respectively. We believe that our work not only casts new light on training DP Transformers but also promotes a modular treatment to advance research in the field of differentially private deep learning.

Delving into Differentially Private Transformer

TL;DR

This work tackles the challenge of training Transformer models under differential privacy by proposing a modular reduction to the better-understood DP vanilla nets. It identifies two DP-specific bottlenecks: attention distraction caused by DP noise and embedding-sharing complicating efficient gradient clipping. To address these, the authors introduce Phantom Clipping, enabling efficient DP-SGD with embedding sharing, and the Re-Attention Mechanism, which tracks effective noise variance via Bayesian-style propagation to yield unbiased attention. Empirical results on short- and long-tailed recommendation tasks show that these methods improve training stability and DP-utility trade-offs, particularly under stricter privacy and imbalanced data, highlighting the practical potential of modular DP learning for Transformers.

Abstract

Deep learning with differential privacy (DP) has garnered significant attention over the past years, leading to the development of numerous methods aimed at enhancing model accuracy and training efficiency. This paper delves into the problem of training Transformer models with differential privacy. Our treatment is modular: the logic is to `reduce' the problem of training DP Transformer to the more basic problem of training DP vanilla neural nets. The latter is better understood and amenable to many model-agnostic methods. Such `reduction' is done by first identifying the hardness unique to DP Transformer training: the attention distraction phenomenon and a lack of compatibility with existing techniques for efficient gradient clipping. To deal with these two issues, we propose the Re-Attention Mechanism and Phantom Clipping, respectively. We believe that our work not only casts new light on training DP Transformers but also promotes a modular treatment to advance research in the field of differentially private deep learning.
Paper Structure (27 sections, 2 theorems, 37 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 2 theorems, 37 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Lemma 3.1

Let $X$, $Y$ be two independent random variables, $Z=XY$, then the variance of $Z$ can be expressed as

Figures (11)

  • Figure 1: The modular treatment in this work, where the focus of this work is on the first 'reduction'.
  • Figure 2: Phantom Clipping, illustrated.
  • Figure 3: Re-Attention Mechanism, illustrated.
  • Figure 4: Each run is repeated five times with independent random seeds, with test accuracy (i.e., NDCG@10(%) and HIT@10(%)) reported every five epochs. The graduated shading (best viewed zoomed in) represents confidence intervals from 60% to 100%.
  • Figure 5: GELU activation and ReLU activation.
  • ...and 6 more figures

Theorems & Definitions (11)

  • Definition 2.1
  • Definition 4.1
  • Claim 4.2
  • Claim 2.1
  • proof
  • proof
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • ...and 1 more