The Data Minimization Principle in Machine Learning
Prakhar Ganesh, Cuong Tran, Reza Shokri, Ferdinando Fioretto
TL;DR
The paper addresses the lack of a rigorous mathematical formulation for data minimization in ML by proposing an optimization-based framework grounded in legal definitions. It introduces a structured set of threat models (re-identification, reconstruction, and membership inference) and concrete privacy metrics (RIR, RCR, MIR) to evaluate minimization methods, including baselines such as feature selection and personalized subsampling. Through these components, it demonstrates the potential mismatch between raw minimization (e.g., dataset size reduction) and actual privacy benefits, highlighting personalization and multi-faceted privacy risks as crucial considerations. The framework and analyses offer a practical pathway for designing privacy-preserving ML systems that comply with data protection regulations while balancing utility, and they motivate further research into personalized data minimization and adversarial privacy auditing.
Abstract
The principle of data minimization aims to reduce the amount of data collected, processed or retained to minimize the potential for misuse, unauthorized access, or data breaches. Rooted in privacy-by-design principles, data minimization has been endorsed by various global data protection regulations. However, its practical implementation remains a challenge due to the lack of a rigorous formulation. This paper addresses this gap and introduces an optimization framework for data minimization based on its legal definitions. It then adapts several optimization algorithms to perform data minimization and conducts a comprehensive evaluation in terms of their compliance with minimization objectives as well as their impact on user privacy. Our analysis underscores the mismatch between the privacy expectations of data minimization and the actual privacy benefits, emphasizing the need for approaches that account for multiple facets of real-world privacy risks.
