Nob-MIAs: Non-biased Membership Inference Attacks Assessment on Large Language Models with Ex-Post Dataset Construction
Cédric Eichler, Nathan Champeil, Nicolas Anciaux, Alexandra Bensamoun, Heber Hwang Arcolezi, José Maria De Fuentes
TL;DR
The paper tackles the challenge of evaluating Membership Inference Attacks on large language models when training data is only partially inferable, a situation common in copyright and privacy contexts. It introduces Nob-MIAs, a set of algorithms to construct ex-post datasets that are non-biased (No-Ngram) and non-classifiable (No-Class) to enable fair MIA assessment. Experimental validation on Gutenberg PG-19 with OpenLLaMA and Pythia shows that neutralizing known biases reduces apparent MIA effectiveness, with the best-performing attack dropping in TPR@10%FPR by about 40% and AUC by about 14.3% when moving from random to Nob datasets. The work provides a more robust framework for assessing copyright leakage and privacy risks in LLM pretraining, offering directions for extending the approach to non-text data and refining residual-bias detection.
Abstract
The rise of Large Language Models (LLMs) has triggered legal and ethical concerns, especially regarding the unauthorized use of copyrighted materials in their training datasets. This has led to lawsuits against tech companies accused of using protected content without permission. Membership Inference Attacks (MIAs) aim to detect whether specific documents were used in a given LLM pretraining, but their effectiveness is undermined by biases such as time-shifts and n-gram overlaps. This paper addresses the evaluation of MIAs on LLMs with partially inferable training sets, under the ex-post hypothesis, which acknowledges inherent distributional biases between members and non-members datasets. We propose and validate algorithms to create ``non-biased'' and ``non-classifiable'' datasets for fairer MIA assessment. Experiments using the Gutenberg dataset on OpenLamma and Pythia show that neutralizing known biases alone is insufficient. Our methods produce non-biased ex-post datasets with AUC-ROC scores comparable to those previously obtained on genuinely random datasets, validating our approach. Globally, MIAs yield results close to random, with only one being effective on both random and our datasets, but its performance decreases when bias is removed.
