Table of Contents
Fetching ...

Generative Models for Effective ML on Private, Decentralized Datasets

Sean Augenstein, H. Brendan McMahan, Daniel Ramage, Swaroop Ramaswamy, Peter Kairouz, Mingqing Chen, Rajiv Mathews, Blaise Aguera y Arcas

TL;DR

The paper tackles debugging ML systems when data cannot be inspected due to privacy or decentralization, proposing privacy-preserving federated generative models trained with differential privacy as stand-ins for data inspection. By coupling deep generative models with federated learning and user-level DP, it introduces DP-FedAvg for RNNs on text and a DP-FedAvg-GAN algorithm for images, enabling high-fidelity synthetic data while preserving privacy. The authors demonstrate the approach with DP Federated RNNs to reveal tokenization bugs and OOV dynamics, and with DP Federated GANs to detect a pixel-inversion bug in on-device image preprocessing, including a discussion of privacy budgets and real-world scaling. The work suggests a new class of tools for model debugging, labeling, and bias detection in private, decentralized settings, and provides open-source resources to spur further development.

Abstract

To improve real-world applications of machine learning, experienced modelers develop intuition about their datasets, their models, and how the two interact. Manual inspection of raw data - of representative samples, of outliers, of misclassifications - is an essential tool in a) identifying and fixing problems in the data, b) generating new modeling hypotheses, and c) assigning or refining human-provided labels. However, manual data inspection is problematic for privacy sensitive datasets, such as those representing the behavior of real-world individuals. Furthermore, manual data inspection is impossible in the increasingly important setting of federated learning, where raw examples are stored at the edge and the modeler may only access aggregated outputs such as metrics or model parameters. This paper demonstrates that generative models - trained using federated methods and with formal differential privacy guarantees - can be used effectively to debug many commonly occurring data issues even when the data cannot be directly inspected. We explore these methods in applications to text with differentially private federated RNNs and to images using a novel algorithm for differentially private federated GANs.

Generative Models for Effective ML on Private, Decentralized Datasets

TL;DR

The paper tackles debugging ML systems when data cannot be inspected due to privacy or decentralization, proposing privacy-preserving federated generative models trained with differential privacy as stand-ins for data inspection. By coupling deep generative models with federated learning and user-level DP, it introduces DP-FedAvg for RNNs on text and a DP-FedAvg-GAN algorithm for images, enabling high-fidelity synthetic data while preserving privacy. The authors demonstrate the approach with DP Federated RNNs to reveal tokenization bugs and OOV dynamics, and with DP Federated GANs to detect a pixel-inversion bug in on-device image preprocessing, including a discussion of privacy budgets and real-world scaling. The work suggests a new class of tools for model debugging, labeling, and bias detection in private, decentralized settings, and provides open-source resources to spur further development.

Abstract

To improve real-world applications of machine learning, experienced modelers develop intuition about their datasets, their models, and how the two interact. Manual inspection of raw data - of representative samples, of outliers, of misclassifications - is an essential tool in a) identifying and fixing problems in the data, b) generating new modeling hypotheses, and c) assigning or refining human-provided labels. However, manual data inspection is problematic for privacy sensitive datasets, such as those representing the behavior of real-world individuals. Furthermore, manual data inspection is impossible in the increasingly important setting of federated learning, where raw examples are stored at the edge and the modeler may only access aggregated outputs such as metrics or model parameters. This paper demonstrates that generative models - trained using federated methods and with formal differential privacy guarantees - can be used effectively to debug many commonly occurring data issues even when the data cannot be directly inspected. We explore these methods in applications to text with differentially private federated RNNs and to images using a novel algorithm for differentially private federated GANs.

Paper Structure

This paper contains 46 sections, 1 equation, 10 figures, 13 tables, 2 algorithms.

Figures (10)

  • Figure 1: Percentage of samples generated from the word-LM that are OOV by position in the sentence, with and without bug.
  • Figure 1: DP-FedAvg-GAN, based on DP-FedAvg (App. \ref{['sec.ExpDetailsGAN']}) but accounts for training both GAN models.
  • Figure 2: Examples of primary model CNN input, from EMNIST with letters and digits (62 classes).
  • Figure 2: DP-FedAvg with fixed-size federated rounds, used to train word- and char-LMs in Section \ref{['sec.ExpRNNLM']}.
  • Figure 3: DP federated GAN generator output given an inversion bug on 50% of devices.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 1