Can We Infer Confidential Properties of Training Data from LLMs?
Pengrun Huang, Chhavi Yadav, Kamalika Chaudhuri, Ruihan Wu
TL;DR
This work investigates the leakage of dataset-level properties from domain-specific fine-tuned LLMs, introducing PropInfer as a benchmark to evaluate property inference under two fine-tuning modes: Q&A (SFT) and chat-completion (CLM-FT) using the ChatDoctor dataset. It proposes two attacks tailored to LLMs—a prompt-based generation attack for black-box access and a shadow-model, word-frequency attack for grey-box access—and demonstrates their effectiveness across multiple models and target properties. Results show that property inference is a practical threat: the word-frequency shadow attack excels in Q&A mode while generation-based attacks thrive in Chat-Completion mode, with varying efficacy depending on whether the target property appears in questions, answers, or both. The findings highlight a real risk to dataset confidentiality in real-world deployments and provide a standardized framework for developing defenses and further evaluations.
Abstract
Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.
