Table of Contents
Fetching ...

MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector

Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, Tao Jiang

TL;DR

The paper tackles privacy and copyright risks from LLM pre-training data by reframing detection as a membership inference attack and introducing MIA-Tuner, which instructs LLMs to detect their own training data via a dedicated soft prompt. It provides an up-to-date benchmark, WIKIMIA-24, and two defense pipelines to counter both existing detectors and the proposed method. Empirical results show MIA-Tuner achieves state-of-the-art detection with an average AUC around 0.97 across aligned and unaligned models and demonstrates few-shot viability. The work also proposes defense mechanisms that substantially reduce detection success with minimal impact on model utility, highlighting practical implications for publishing and fine-tuning LLMs in privacy-sensitive settings.

Abstract

The increasing parameters and expansive dataset of large language models (LLMs) highlight the urgent demand for a technical solution to audit the underlying privacy risks and copyright issues associated with LLMs. Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA). This problem involves determining whether a given piece of text has been used during the pre-training phase of the target LLM. Although existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance in pre-trained LLMs, how to achieve high-confidence detection and how to perform MIA on aligned LLMs remain challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector internally, rather than design an external MIA score function. Furthermore, we design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner. To comprehensively evaluate the most recent state-of-the-art LLMs, we collect a more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely adopted benchmark WIKIMIA. We conduct extensive experiments across various aligned and unaligned LLMs over the two benchmark datasets. The results demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a significantly high level of 0.9.

MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector

TL;DR

The paper tackles privacy and copyright risks from LLM pre-training data by reframing detection as a membership inference attack and introducing MIA-Tuner, which instructs LLMs to detect their own training data via a dedicated soft prompt. It provides an up-to-date benchmark, WIKIMIA-24, and two defense pipelines to counter both existing detectors and the proposed method. Empirical results show MIA-Tuner achieves state-of-the-art detection with an average AUC around 0.97 across aligned and unaligned models and demonstrates few-shot viability. The work also proposes defense mechanisms that substantially reduce detection success with minimal impact on model utility, highlighting practical implications for publishing and fine-tuning LLMs in privacy-sensitive settings.

Abstract

The increasing parameters and expansive dataset of large language models (LLMs) highlight the urgent demand for a technical solution to audit the underlying privacy risks and copyright issues associated with LLMs. Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA). This problem involves determining whether a given piece of text has been used during the pre-training phase of the target LLM. Although existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance in pre-trained LLMs, how to achieve high-confidence detection and how to perform MIA on aligned LLMs remain challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector internally, rather than design an external MIA score function. Furthermore, we design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner. To comprehensively evaluate the most recent state-of-the-art LLMs, we collect a more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely adopted benchmark WIKIMIA. We conduct extensive experiments across various aligned and unaligned LLMs over the two benchmark datasets. The results demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a significantly high level of 0.9.
Paper Structure (29 sections, 12 equations, 7 figures, 5 tables)

This paper contains 29 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The overall framework of MIA-Tuner and the two pipelines designed for aligned and unaligned LLMs, resprectively.
  • Figure 2: The performance of MIA-Tuner on LLaMA-2 while utilizing different numbers of fine-tuning samples.
  • Figure 3: The detection performance of all baselines on LLaMA-2 w/ and w/o the proposed safeguard.
  • Figure 4: The accuracy of LLaMA-2 on the MMLU benchmark w/ and w/o the proposed safeguard across four different types of tasks.
  • Figure 5: The fine-tuning (FT) PPL of (a) the benign user and the detection AUC of (b) the malicious user across aligned and unaligned LLMs in three stages: Before FT, After FT (w/o Defender), After FT (w/ Defender).
  • ...and 2 more figures