Beyond Labeling Oracles: What does it mean to steal ML models?

Avital Shafran; Ilia Shumailov; Murat A. Erdogdu; Nicolas Papernot

Beyond Labeling Oracles: What does it mean to steal ML models?

Avital Shafran, Ilia Shumailov, Murat A. Erdogdu, Nicolas Papernot

TL;DR

This work challenges the standard assumption that model extraction is cost-efficient by showing that an attacker’s success is largely driven by access to in-distribution data rather than the querying strategy. By formalizing the attacker as $(\,\mathcal{D}_{IND}, \,\mathcal{D}_{OOD}, \,\Pi)$ and decomposing ML costs into data collection, labeling, and training, the authors demonstrate that prior knowledge about the victim’s distribution often dominates ME performance. They introduce an instrumentation framework that modulates OOD informativeness using a hybrid victim model $\,\mathcal{V}_h$ with a safe fake component $\,\mathcal{V}_f$ controlled by a threshold $ au$ and a temperature $T$, enabling controlled ablations of OOD leakage. Across vision and language benchmarks, experiments reveal that ME can be effective but is typically constrained by IND data availability; using OOD data to reduce data costs only helps when OOD responses reveal IND structure, and attempting to minimize both data costs and queries simultaneously is usually infeasible. The practical takeaway is a call to redefine adversarial goals for ME attacks, recognizing that a victim model often serves mainly as a labeling oracle and that robust defenses should focus on limiting IND leakage rather than relying on random data or purely query-only strategies.

Abstract

Model extraction attacks are designed to steal trained models with only query access, as is often provided through APIs that ML-as-a-Service providers offer. Machine Learning (ML) models are expensive to train, in part because data is hard to obtain, and a primary incentive for model extraction is to acquire a model while incurring less cost than training from scratch. Literature on model extraction commonly claims or presumes that the attacker is able to save on both data acquisition and labeling costs. We thoroughly evaluate this assumption and find that the attacker often does not. This is because current attacks implicitly rely on the adversary being able to sample from the victim model's data distribution. We thoroughly research factors influencing the success of model extraction. We discover that prior knowledge of the attacker, i.e., access to in-distribution data, dominates other factors like the attack policy the adversary follows to choose which queries to make to the victim model API. Our findings urge the community to redefine the adversarial goals of ME attacks as current evaluation methods misinterpret the ME performance.

Beyond Labeling Oracles: What does it mean to steal ML models?

TL;DR

and decomposing ML costs into data collection, labeling, and training, the authors demonstrate that prior knowledge about the victim’s distribution often dominates ME performance. They introduce an instrumentation framework that modulates OOD informativeness using a hybrid victim model

with a safe fake component

controlled by a threshold

and a temperature

, enabling controlled ablations of OOD leakage. Across vision and language benchmarks, experiments reveal that ME can be effective but is typically constrained by IND data availability; using OOD data to reduce data costs only helps when OOD responses reveal IND structure, and attempting to minimize both data costs and queries simultaneously is usually infeasible. The practical takeaway is a call to redefine adversarial goals for ME attacks, recognizing that a victim model often serves mainly as a labeling oracle and that robust defenses should focus on limiting IND leakage rather than relying on random data or purely query-only strategies.

Abstract

Paper Structure (32 sections, 1 equation, 20 figures, 1 table)

This paper contains 32 sections, 1 equation, 20 figures, 1 table.

Introduction
Related work
Background
Definitions
Primer on ML costing
Methodology
Sampling complexity intuition
On hardness of OOD detection
Out-of-Distribution instrumentation
Evaluation
Experimental Setting.
Does model extraction work?
Can ME be used with only a few queries?
Can ME be used to reduce data costs?
Can ME be used both to reduce the data costs and use only a few queries?
...and 17 more sections

Figures (20)

Figure 1: Consider a linear classifier for which the decision boundary is given by the line $y=\alpha x$. An attacker attempts to steal the model (i.e. find the corresponding $\alpha=1$ from the example above). The green region $x \sim [20, 80]$ is in-distribution behaviour that the attacker wants to replicate, the red region $x \sim [0, 20] \cup [80, 100]$ is out of distribution and is not important for the task. The left case requires a single parameter to be approximated, the middle needs 5, whereas the right requires 9.
Figure 2: A comparison between the baseline attacker, which only uses its prior knowledge, and an attacker that can augment its queries with additional queries sampled from other data distributions. We fix the query budget to be the size of the original training set for a fair comparison. Attackers with more prior knowledge do not benefit much by augmenting the query set.
Figure 3: Evaluation of the risk posed by an attacker with some prior knowledge over the true data distribution, using different labeling sources. As can be seen, labels provided by the victim model, either in the richer soft-label setting or in the more restrictive label-only setting, does not provide benefit over the real ground truth labels. This shows that the victim is essentially a labeling oracle.
Figure 4: The effect of controlling OOD informativeness with different values of $\tau$ against an attacker that utilizes additional queries. In all cases other than DFME, the attacker adds the size of the training set additional queries, and for DFME adds 20M queries. When comparing the results to the original setting (real model), where the OOD region is unmodified, we can see a clear decrease in the attack accuracy.
Figure 5: Comparison between the convergence rate of an attacker that uses the victim's full probability vector output (soft labels), an attacker that utilizes a label-only access to the victim model, and an attacker that uses the real ground truth labels. In all cases the attacker has access to $30\%$ of the true training samples. The attacker does not "learn faster" by attacking the victim model, and only benefits from the victim model when it has little prior knowledge over the true data distribution.
...and 15 more figures

Beyond Labeling Oracles: What does it mean to steal ML models?

TL;DR

Abstract

Beyond Labeling Oracles: What does it mean to steal ML models?

Authors

TL;DR

Abstract

Table of Contents

Figures (20)