Defense Against Model Stealing Based on Account-Aware Distribution Discrepancy
Jian-Ping Mei, Weibin Zhang, Jie Chen, Xuyun Zhang, Tiantian Zhu
TL;DR
This work tackles model stealing from black-box image-classification services by introducing Account-aware Distribution Distance (ADD), a non-parametric detector that leverages account-level query dependencies in embedding space. ADD models each class as a Multivariate Normal distribution and uses the squared Fréchet distance to quantify distribution discrepancy between reference and account-specific query statistics, yielding a Malicious Score that feeds a plug‑and‑play defense (D-ADD) with random prediction poisoning. The approach preserves utility for benign users under soft- and hard-label outputs and demonstrates strong defense against diverse cloning attacks, including adaptive strategies, while remaining training-free and lightweight. Empirical results across multiple datasets show superior detection and robust protection with minimal target-model utility loss, highlighting practical potential for deployment in commercial APIs and informing future work on integrated defense frameworks.
Abstract
Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such model-stealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.
