I Prefer not to Say: Protecting User Consent in Models with Optional Personal Data
Tobias Leemann, Martin Pawelczyk, Christian Thomas Eberle, Gjergji Kasneci
TL;DR
The paper tackles privacy concerns in ML systems where users can opt into optional data by formalizing Availability Inference Restriction (AIR) and Protected User Consent (PUC). It presents a model-agnostic data augmentation approach, PUCIDA, to learn PUC-compliant predictors that minimize loss $\mathcal{L}$ under AIR, and proves both predictive non-degradation and finite-sample convergence guarantees. The work provides a formal, scalable framework for balancing privacy, accuracy, and regulatory compliance, including a multi-feature generalization (r-dimensional PUC) and strategic-withholding considerations. Empirically, PUCIDA removes the disadvantage faced by non-sharers while enabling the decision-maker to leverage consenting users’ data with only moderate performance trade-offs, as demonstrated on eight real datasets and synthetic benchmarks.
Abstract
We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decision-making system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. We observe that privacy and performance are not fundamentally at odds with each other and that it is possible for a decision maker to benefit from additional data while respecting users' consent. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on challenging real datasets, tasks, and models.
