Table of Contents
Fetching ...

SoK: Data Minimization in Machine Learning

Robin Staab, Nikola Jovanović, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski

TL;DR

A structured overview is designed to help practitioners and researchers effectively adopt and apply DM principles in ML, by helping them identify relevant techniques and understand underlying assumptions and trade-offs through a DM-centric lens.

Abstract

Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, we present the first systematization of knowledge (SoK) for DMML. We introduce a general framework for DMML, encompassing a unified data pipeline, adversarial models, and points of minimization. This framework allows us to systematically review data minimization literature as well as DM-adjacent methodologies whose link to DM was often overlooked. Our structured overview is designed to help practitioners and researchers effectively adopt and apply DM principles in ML, by helping them identify relevant techniques and understand underlying assumptions and trade-offs through a DM-centric lens.

SoK: Data Minimization in Machine Learning

TL;DR

A structured overview is designed to help practitioners and researchers effectively adopt and apply DM principles in ML, by helping them identify relevant techniques and understand underlying assumptions and trade-offs through a DM-centric lens.

Abstract

Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, we present the first systematization of knowledge (SoK) for DMML. We introduce a general framework for DMML, encompassing a unified data pipeline, adversarial models, and points of minimization. This framework allows us to systematically review data minimization literature as well as DM-adjacent methodologies whose link to DM was often overlooked. Our structured overview is designed to help practitioners and researchers effectively adopt and apply DM principles in ML, by helping them identify relevant techniques and understand underlying assumptions and trade-offs through a DM-centric lens.

Paper Structure

This paper contains 49 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of our work. The Data Minimization (DM) principle is referenced by regulations such as GDPR (top left, \ref{['sec:regulations']}). At the same time, DM as the act of minimizing data is present in some form as part of various DM-adjacent techniques, without this connection being made explicit, which leads to confusion (bottom left). We unify these under a joint DMML framework (\ref{['sec:framework']}). In particular, a well-defined DMML workflow (middle, detailed in \ref{['fig:pipeline']}) and a list of relevant dimensions of DM techniques (right) allow us to systematically analyze all relevant techniques (\ref{['sec:techniques']}).
  • Figure 2: Illustration of DMML actors (\ref{['ssec:framework:actors']}), pipeline (\ref{['ssec:framework:pipeline']}), and adversaries (\ref{['ssec:framework:quantifying']}). During training and inference, data is provided by clients which may transform it () before sending it to the collector. The collector can further transform the data and may store it for later use. Finally, data is sent to a server to train the final model (another transformation). Between any two transformations an adversary () could intercept the data, threatening client's privacy.