The Data Minimization Principle in Machine Learning

Prakhar Ganesh; Cuong Tran; Reza Shokri; Ferdinando Fioretto

The Data Minimization Principle in Machine Learning

Prakhar Ganesh, Cuong Tran, Reza Shokri, Ferdinando Fioretto

TL;DR

The paper addresses the lack of a rigorous mathematical formulation for data minimization in ML by proposing an optimization-based framework grounded in legal definitions. It introduces a structured set of threat models (re-identification, reconstruction, and membership inference) and concrete privacy metrics (RIR, RCR, MIR) to evaluate minimization methods, including baselines such as feature selection and personalized subsampling. Through these components, it demonstrates the potential mismatch between raw minimization (e.g., dataset size reduction) and actual privacy benefits, highlighting personalization and multi-faceted privacy risks as crucial considerations. The framework and analyses offer a practical pathway for designing privacy-preserving ML systems that comply with data protection regulations while balancing utility, and they motivate further research into personalized data minimization and adversarial privacy auditing.

Abstract

The principle of data minimization aims to reduce the amount of data collected, processed or retained to minimize the potential for misuse, unauthorized access, or data breaches. Rooted in privacy-by-design principles, data minimization has been endorsed by various global data protection regulations. However, its practical implementation remains a challenge due to the lack of a rigorous formulation. This paper addresses this gap and introduces an optimization framework for data minimization based on its legal definitions. It then adapts several optimization algorithms to perform data minimization and conducts a comprehensive evaluation in terms of their compliance with minimization objectives as well as their impact on user privacy. Our analysis underscores the mismatch between the privacy expectations of data minimization and the actual privacy benefits, emphasizing the need for approaches that account for multiple facets of real-world privacy risks.

The Data Minimization Principle in Machine Learning

TL;DR

Abstract

Paper Structure (37 sections, 11 equations, 2 figures)

This paper contains 37 sections, 11 equations, 2 figures.

Introduction
Personalization in data minimization.
Data minimization and privacy.
Contributions.
Threat Models
Attacker Access
Attacker Objectives and Associated Risks
Re-identification Attacks
Reconstruction Attacks
Membership Inference Attacks
Privacy Metrics
Re-identification Risk (RIR)
Reconstruction Risk (RCR)
Membership Inference Risk (MIR)
Complexity of Data Minimization
...and 22 more sections

Figures (2)

Figure 1: An overview of our framework to study data minimization, and its place in a real-world ML pipeline. In the first half of the pipeline, we highlight the formalization of data minimization and quantify the risks of a data breach. In the second half, i.e., under the assumption of secure data transmission to train the learning model, we establish the objectives of data minimization through utility measurement and further study potential privacy leakage through model breach/release. Our framework offers a comprehensive perspective on the integration of data minimization into responsible data collection and management.
Figure 2: Caption.

The Data Minimization Principle in Machine Learning

TL;DR

Abstract

The Data Minimization Principle in Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)