DP-NMT: Scalable Differentially-Private Machine Translation
Timour Igamberdiev, Doan Nam Long Vu, Felix Künnecke, Zhuo Yu, Jannik Holmer, Ivan Habernal
TL;DR
DP-NMT addresses privacy concerns in neural machine translation by introducing an open-source, JAX/Flax-based framework for training NMT models with differential privacy via DP-SGD. It clarifies DP-SGD implementation details and examines privacy amplification under Poisson sampling versus random shuffling, enabling scalable experimentation on standard and privacy-focused datasets. The paper demonstrates the framework on multiple NMT datasets, reporting privacy/utility trade-offs and highlighting how dataset size and sampling method influence performance under fixed budgets. This work provides a practical, reproducible platform for advancing privacy-preserving NMT research and invites community feedback to broaden model support and dataset coverage.
Abstract
Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
