Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing
Vilém Zouhar
TL;DR
This work investigates how subword vocabularies, particularly BPE, affect learning-based MT model stealing. It introduces a formal MT stealing setup, contrasting black-box and gray-box access, and evaluates how vocabulary choices influence student performance via BLEU. The findings indicate that the victim's BPE vocabulary has only a marginal impact on the stolen model's accuracy, while gray-box access enables efficient recovery of the victim's vocabulary with high overlap, highlighting security considerations for knowledge distillation. The results underscore that practical attacks can reconstruct vocabularies from outputs and that domain-aligned vocabularies are more important for efficiency than exact vocabulary replication, with broad implications for defenses against model stealing and distillation. All mathematical expressions are presented with proper delimiters to maintain precise representation of the underlying concepts, such as the BPE efficiency $\frac{|B_i(D_j)|}{|B_j(D_j)|}$ and vocabulary overlap $\frac{2|V\cap V'|}{|V|+|V'|}$.
Abstract
In learning-based functionality stealing, the attacker is trying to build a local model based on the victim's outputs. The attacker has to make choices regarding the local model's architecture, optimization method and, specifically for NLP models, subword vocabulary, such as BPE. On the machine translation task, we explore (1) whether the choice of the vocabulary plays a role in model stealing scenarios and (2) if it is possible to extract the victim's vocabulary. We find that the vocabulary itself does not have a large effect on the local model's performance. Given gray-box model access, it is possible to collect the victim's vocabulary by collecting the outputs (detokenized subwords on the output). The results of the minimum effect of vocabulary choice are important more broadly for black-box knowledge distillation.
