Table of Contents
Fetching ...

In-Context Probing for Membership Inference in Fine-Tuned Language Models

Zhexi Lu, Hongliang Chi, Nathalie Baracaldo, Swanand Ravindra Kadhe, Yuseok Jeon, Lei Yu

TL;DR

The paper tackles privacy risks from membership inference in fine-tuned LLMs by grounding attacks in training dynamics. It introduces the Optimization Gap and In-Context Probing (ICP) to estimate residual learning potential without retraining, enabling a practical black-box MIA. ICP-MIA combines reference-based and self-perturbation probing to achieve state-of-the-art or competitive performance, especially under low false-positive constraints, across diverse LLMs and tasks. It also analyzes how model type, PEFT configurations, data alignment, and differential privacy influence attack effectiveness, offering actionable guidance for privacy auditing and defense design.

Abstract

Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.

In-Context Probing for Membership Inference in Fine-Tuned Language Models

TL;DR

The paper tackles privacy risks from membership inference in fine-tuned LLMs by grounding attacks in training dynamics. It introduces the Optimization Gap and In-Context Probing (ICP) to estimate residual learning potential without retraining, enabling a practical black-box MIA. ICP-MIA combines reference-based and self-perturbation probing to achieve state-of-the-art or competitive performance, especially under low false-positive constraints, across diverse LLMs and tasks. It also analyzes how model type, PEFT configurations, data alignment, and differential privacy influence attack effectiveness, offering actionable guidance for privacy auditing and defense design.

Abstract

Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.

Paper Structure

This paper contains 63 sections, 10 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Log-likelihood improvement distribution on the HealthcareMagic dataset. Member samples (blue) show minimal gains from in-context probing, while non-members (red) exhibit larger, more variable improvements, revealing the optimization gap that underlies our attack.
  • Figure 2: Empirical illustration of diminishing returns during LLM fine-tuning.
  • Figure 3: Fine-tuning with Members V.S. Non-Members
  • Figure 4: Correlation between actual single-step training loss reduction and the ICP-induced loss reduction.
  • Figure 5: Impact of reference data (prefix pool) and model type on the fidelity of the ICP proxy, measured by Spearman correlation. The proxy demonstrates the highest effectiveness (strongest correlation) with instruction-tuned models and when the reference data closely aligns with the target task’s domain and semantics.
  • ...and 8 more figures