DP-NMT: Scalable Differentially-Private Machine Translation

Timour Igamberdiev; Doan Nam Long Vu; Felix Künnecke; Zhuo Yu; Jannik Holmer; Ivan Habernal

DP-NMT: Scalable Differentially-Private Machine Translation

Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal

TL;DR

DP-NMT addresses privacy concerns in neural machine translation by introducing an open-source, JAX/Flax-based framework for training NMT models with differential privacy via DP-SGD. It clarifies DP-SGD implementation details and examines privacy amplification under Poisson sampling versus random shuffling, enabling scalable experimentation on standard and privacy-focused datasets. The paper demonstrates the framework on multiple NMT datasets, reporting privacy/utility trade-offs and highlighting how dataset size and sampling method influence performance under fixed budgets. This work provides a practical, reproducible platform for advancing privacy-preserving NMT research and invites community feedback to broaden model support and dataset coverage.

Abstract

Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.

DP-NMT: Scalable Differentially-Private Machine Translation

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 2 figures, 3 tables, 1 algorithm.

Introduction
DP-SGD and subsampling
Related work
Applications of DP-SGD to NLP
Private neural machine translation
Description of software
Accelerated DP-SGD with JAX and Flax
Model training and inference
Integrating DPDataloader from Opacus
Engineering challenges for LLMs
Experiments
Datasets
Experimental setup
Results and Discussion
Privacy/utility trade-off
...and 7 more sections

Figures (2)

Figure 1: Framework Pipeline. Similar components are represented with different colors. Green: Dataset selection. Blue: Experimental configurations (including privacy settings). Grey: Dataset preparation. Orange: Model-specific elements. Red: Model training. Purple: Model inference. Yellow: Output of experiments.
Figure 2: Test BLEU scores for each of the three datasets using varying privacy budgets, comparing the random shuffling and Poisson sampling methods to iterate over the dataset. Non-private results are additionally shown for each dataset ($\varepsilon = \infty$) with random shuffling. Lower $\varepsilon$ corresponds to a stronger privacy guarantee.

DP-NMT: Scalable Differentially-Private Machine Translation

TL;DR

Abstract

DP-NMT: Scalable Differentially-Private Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)