Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition
Muhammad H. Ashiq, Peter Triantafillou, Hung Yun Tseng, Grigoris G. Chrysos
TL;DR
This work introduces test-time privacy for open-weight image classifiers by defending against adversaries who leverage confident predictions on corrupted data. It formulates a Pareto-optimal fine-tuning approach that makes forget-set predictions uniform while preserving performance on retained data, enabled by a uniform learner and two algorithms: the Exact Pareto Learner and a certified variant. The authors derive a tight theoretical privacy-utility bound and provide a certified, Hessian-based framework for practical deployment, including an unbiased Hessian estimator and online extensions. Empirically, the method yields over 3× reductions in forget-set confidence with minimal drops in retain/test accuracy across multiple datasets and architectures, outperforming several baselines and offering reproducible results.
Abstract
A key concern for AI safety remains understudied in the machine learning (ML) literature: how can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others? This is particularly pertinent given the rise of open-weight models, where simply masking model outputs does not suffice to prevent adversaries from recovering harmful predictions. To address this threat, which we call *test-time privacy*, we induce maximal uncertainty on protected instances while preserving accuracy on all other instances. Our proposed algorithm uses a Pareto optimal objective that explicitly balances test-time privacy against utility. We also provide a certifiable approximation algorithm which achieves $(\varepsilon, δ)$ guarantees without convexity assumptions. We then prove a tight bound that characterizes the privacy-utility tradeoff that our algorithms incur. Empirically, our method obtains at least $>3\times$ stronger uncertainty than pretraining with marginal drops in accuracy on various image recognition benchmarks. Altogether, this framework provides a tool to guarantee additional protection to end users.
