Table of Contents
Fetching ...

DP-NMT: Scalable Differentially-Private Machine Translation

Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal

TL;DR

DP-NMT addresses privacy concerns in neural machine translation by introducing an open-source, JAX/Flax-based framework for training NMT models with differential privacy via DP-SGD. It clarifies DP-SGD implementation details and examines privacy amplification under Poisson sampling versus random shuffling, enabling scalable experimentation on standard and privacy-focused datasets. The paper demonstrates the framework on multiple NMT datasets, reporting privacy/utility trade-offs and highlighting how dataset size and sampling method influence performance under fixed budgets. This work provides a practical, reproducible platform for advancing privacy-preserving NMT research and invites community feedback to broaden model support and dataset coverage.

Abstract

Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.

DP-NMT: Scalable Differentially-Private Machine Translation

TL;DR

DP-NMT addresses privacy concerns in neural machine translation by introducing an open-source, JAX/Flax-based framework for training NMT models with differential privacy via DP-SGD. It clarifies DP-SGD implementation details and examines privacy amplification under Poisson sampling versus random shuffling, enabling scalable experimentation on standard and privacy-focused datasets. The paper demonstrates the framework on multiple NMT datasets, reporting privacy/utility trade-offs and highlighting how dataset size and sampling method influence performance under fixed budgets. This work provides a practical, reproducible platform for advancing privacy-preserving NMT research and invites community feedback to broaden model support and dataset coverage.

Abstract

Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
Paper Structure (22 sections, 3 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Framework Pipeline. Similar components are represented with different colors. Green: Dataset selection. Blue: Experimental configurations (including privacy settings). Grey: Dataset preparation. Orange: Model-specific elements. Red: Model training. Purple: Model inference. Yellow: Output of experiments.
  • Figure 2: Test BLEU scores for each of the three datasets using varying privacy budgets, comparing the random shuffling and Poisson sampling methods to iterate over the dataset. Non-private results are additionally shown for each dataset ($\varepsilon = \infty$) with random shuffling. Lower $\varepsilon$ corresponds to a stronger privacy guarantee.