Table of Contents
Fetching ...

Fragments to Facts: Partial-Information Fragment Inference from LLMs

Lucas Rosenblatt, Bin Han, Robert Wolfe, Bill Howe

TL;DR

The paper introduces partial-information fragment inference (PIFI), a pragmatic privacy threat where an adversary with access to unordered text fragments can infer additional private fragments from a fine-tuned LLM. It proposes two data-blind attacks, LR-Attack and PRISM, and a data-aware Classifier baseline, demonstrating their effectiveness in medical and legal summarization tasks across multiple models and training regimes. The results show that fragment-level leakage persists even with modest fine-tuning, larger models are more vulnerable, and world-model priors can help mitigate false positives for common fragments. The work highlights the need for defenses beyond memorization and full-data membership protections, urging further exploration of privacy-preserving fine-tuning and defense strategies in sensitive-domain AI systems.

Abstract

Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has "hypertension," under what conditions can they query a model fine-tuned on patient data to learn the patient also has "osteoarthritis?" In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from both medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.

Fragments to Facts: Partial-Information Fragment Inference from LLMs

TL;DR

The paper introduces partial-information fragment inference (PIFI), a pragmatic privacy threat where an adversary with access to unordered text fragments can infer additional private fragments from a fine-tuned LLM. It proposes two data-blind attacks, LR-Attack and PRISM, and a data-aware Classifier baseline, demonstrating their effectiveness in medical and legal summarization tasks across multiple models and training regimes. The results show that fragment-level leakage persists even with modest fine-tuning, larger models are more vulnerable, and world-model priors can help mitigate false positives for common fragments. The work highlights the need for defenses beyond memorization and full-data membership protections, urging further exploration of privacy-preserving fine-tuning and defense strategies in sensitive-domain AI systems.

Abstract

Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has "hypertension," under what conditions can they query a model fine-tuned on patient data to learn the patient also has "osteoarthritis?" In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from both medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.

Paper Structure

This paper contains 51 sections, 2 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparing the PIFI LLM threat model to the memorization threat model in a medical scenario. PIFI uses unordered, publicly available fragments from a sample (like a patient record) to infer private fragments (like a sensitive medical condition). Memorization assumes access to an ordered prefix of the sample, and checks for verbatim generation by an LLM of the suffix.
  • Figure 2: An illustration of our threat model. An LLM is fine-tuned, e.g., with private medical notes. Then, an adversary prompts the fine-tuned LLM with relevant fragments of information (e.g., from a target individual's medical records) to infer unknown fragments associated with the individual.
  • Figure 3: Successful and failed attack scenarios. An attack is only considered successful when the sequence $\mathbf{s}$ is in the dataset $D$, the target fragment $\mathbf{y^*}$ is in the sequence $\mathbf{s}$, and we accurately infer the target fragment's presence in $\mathbf{s}$. Any other scenario is considered as a failed attack -- (1) the target fragment is NOT in the sequence. (2) the sequence is NOT in the data $D$. (3) the sequence is NOT in the data $D$ and the target fragment is NOT in the sequence.
  • Figure 4: Left: single epoch fine-tuned Llama-3 8B model. Right: convergence fine-tuned Llama-3 8B model. More fine-tuning consistently increases the success rates of PIFI attacks.
  • Figure 5: Left: fully fine-tuned Llama-3 8B model. Right: LoRA fine-tuned Llama-3 8B mode. LoRA-fine-tuned models exhibit less vulnerability than their fully fine-tuned counterparts.
  • ...and 8 more figures