Blackbox Dataset Inference for LLM
Ruikai Zhou, Kang Yang, Xun Chen, Wendy Hui Wang, Guanhong Tao, Jun Xu
TL;DR
This work tackles dataset misuse in large-language-model training by proposing a black-box dataset inference method that relies solely on text-based outputs. It introduces tainted samples and two sets of reference models (non-member and D-fine-tuned member references) to identify whether a suspect model used a victim dataset for training, without requiring intermediate model outputs. The approach achieves near-perfect accuracy under non-IID conditions and strong performance under IID settings, and demonstrates robustness against common evasion strategies and applicability to text generation. The method is computationally efficient online and scalable with richer reference-model diversity, offering a practical tool for copyright and privacy verification in deployed LLMs.
Abstract
Today, the training of large language models (LLMs) can involve personally identifiable information and copyrighted material, incurring dataset misuse. To mitigate the problem of dataset misuse, this paper explores \textit{dataset inference}, which aims to detect if a suspect model $\mathcal{M}$ used a victim dataset $\mathcal{D}$ in training. Previous research tackles dataset inference by aggregating results of membership inference attacks (MIAs) -- methods to determine whether individual samples are a part of the training dataset. However, restricted by the low accuracy of MIAs, previous research mandates grey-box access to $\mathcal{M}$ to get intermediate outputs (probabilities, loss, perplexity, etc.) for obtaining satisfactory results. This leads to reduced practicality, as LLMs, especially those deployed for profits, have limited incentives to return the intermediate outputs. In this paper, we propose a new method of dataset inference with only black-box access to the target model (i.e., assuming only the text-based responses of the target model are available). Our method is enabled by two sets of locally built reference models, one set involving $\mathcal{D}$ in training and the other not. By measuring which set of reference model $\mathcal{M}$ is closer to, we determine if $\mathcal{M}$ used $\mathcal{D}$ for training. Evaluations of real-world LLMs in the wild show that our method offers high accuracy in all settings and presents robustness against bypassing attempts.
