Querying Kernel Methods Suffices for Reconstructing their Training Data
Daniel Barzilai, Yuval Margalit, Eitan Gronich, Gilad Yehudai, Meirav Galun, Ronen Basri
TL;DR
The paper investigates privacy risks in kernel methods under query-only access, showing that an attacker can reconstruct training data for kernel regression, SVM, and KDE. It formalizes a reconstruction loss that leverages only model outputs and proves that, for strictly positive-definite and almost-analytic kernels, a sufficiently large number of queries ($m > n(d+2)$) suffices to recover the training set with probability 1. Empirically, reconstructions on CIFAR10 and CelebA are high quality across multiple kernels, highlighting that parameter-hiding defenses are insufficient in black-box settings. The work underscores privacy concerns in kernel-based learning and motivates the development of robust privacy-preserving techniques even when model parameters are not exposed.
Abstract
Over-parameterized models have raised concerns about their potential to memorize training data, even when achieving strong generalization. The privacy implications of such memorization are generally unclear, particularly in scenarios where only model outputs are accessible. We study this question in the context of kernel methods, and demonstrate both empirically and theoretically that querying kernel models at various points suffices to reconstruct their training data, even without access to model parameters. Our results hold for a range of kernel methods, including kernel regression, support vector machines, and kernel density estimation. Our hope is that this work can illuminate potential privacy concerns for such models.
