Table of Contents
Fetching ...

DP-SGD Without Clipping: The Lipschitz Neural Network Way

Louis Bethune, Thomas Massena, Thibaut Boissin, Yannick Prudent, Corentin Friedrich, Franck Mamalet, Aurelien Bellet, Mathieu Serrurier, David Vigouroux

TL;DR

By bounding the Lipschitz constant of each layer with respect to its parameters, it is proved that these networks can be trained with privacy guarantees.

Abstract

State-of-the-art approaches for training Differentially Private (DP) Deep Neural Networks (DNN) face difficulties to estimate tight bounds on the sensitivity of the network's layers, and instead rely on a process of per-sample gradient clipping. This clipping process not only biases the direction of gradients but also proves costly both in memory consumption and in computation. To provide sensitivity bounds and bypass the drawbacks of the clipping process, we propose to rely on Lipschitz constrained networks. Our theoretical analysis reveals an unexplored link between the Lipschitz constant with respect to their input and the one with respect to their parameters. By bounding the Lipschitz constant of each layer with respect to its parameters, we prove that we can train these networks with privacy guarantees. Our analysis not only allows the computation of the aforementioned sensitivities at scale, but also provides guidance on how to maximize the gradient-to-noise ratio for fixed privacy guarantees. The code has been released as a Python package available at https://github.com/Algue-Rythme/lip-dp

DP-SGD Without Clipping: The Lipschitz Neural Network Way

TL;DR

By bounding the Lipschitz constant of each layer with respect to its parameters, it is proved that these networks can be trained with privacy guarantees.

Abstract

State-of-the-art approaches for training Differentially Private (DP) Deep Neural Networks (DNN) face difficulties to estimate tight bounds on the sensitivity of the network's layers, and instead rely on a process of per-sample gradient clipping. This clipping process not only biases the direction of gradients but also proves costly both in memory consumption and in computation. To provide sensitivity bounds and bypass the drawbacks of the clipping process, we propose to rely on Lipschitz constrained networks. Our theoretical analysis reveals an unexplored link between the Lipschitz constant with respect to their input and the one with respect to their parameters. By bounding the Lipschitz constant of each layer with respect to its parameters, we prove that we can train these networks with privacy guarantees. Our analysis not only allows the computation of the aforementioned sensitivities at scale, but also provides guidance on how to maximize the gradient-to-noise ratio for fixed privacy guarantees. The code has been released as a Python package available at https://github.com/Algue-Rythme/lip-dp
Paper Structure (70 sections, 11 theorems, 83 equations, 19 figures, 8 tables, 4 algorithms)

This paper contains 70 sections, 11 theorems, 83 equations, 19 figures, 8 tables, 4 algorithms.

Key Result

Proposition 1

Assume that the loss fulfills $\|\nabla_{\theta}\mathcal{L}(\hat{y},y)\|_2\leq l$, and assume that the network is trained on a dataset of size $N$ with SGD algorithm for $T$ steps with noise scale $\mathcal{N}(\bf 0, \sigma^2)$ such that: Then the SGD training of the network is $(\epsilon,\delta)$-DP.

Figures (19)

  • Figure 1: An example of usage of our framework , illustrating how to create a small Lipschitz VGG and how to train it under $(\epsilon,\delta)$-DP guarantees while reporting $(\epsilon, \delta)$ values.
  • Figure 2: Backpropagation for bounds (Algorithm \ref{['alg:gnbc']}) computes the per-layer sensitivity $\Delta_d$.
  • Figure 3: Comparison of the utility of DP-SGD and Clipless DP-SGD on tabular and image data.
  • Figure 4: Privacy/accuracy/robustness trade-off on Cifar-10: We report the pareto front of robustness certificates at different radii $r$ for Lipschitz constrained networks, while unconstrained networks cannot produce robustness certificates. Models are trained in an "out of the box setting": no pre-training, no data augmentation and no handcrafted features. To ensure a fair comparison between algorithms, we perform 30 repetitions with a Bayesian optimizer to select the best hyper-parameters.
  • Figure 5: Our approach outperforms concurrent frameworks in terms of runtime and memory: we trained CNNs (ranging from 130K to 2M parameters) on CIFAR-10, and report the median batch processing time (including noise, and constraints application $\Pi$ or gradient clipping).
  • ...and 14 more figures

Theorems & Definitions (32)

  • Definition 1: $(\epsilon,\delta)$-Approximate Differential Privacy
  • Definition 2: $l_2$-sensitivity
  • Definition 3: Lipschitz feed-forward neural network
  • Remark 1: Tighter bounds in literature
  • Remark 2: GNP networks limitations
  • Definition 4: Feedforward neural network
  • Definition 5: Lipschitz constant
  • Definition 6: Lipschitz neural network
  • Definition 7: Gradient Norm Preserving Networks
  • Definition 8: Neighboring datasets
  • ...and 22 more