CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense
Zhuochen Yang, Kar Wai Fok, Vrizlynn L. L. Thing
TL;DR
This paper addresses privacy leakage risks in large language models by analyzing soft-prompt–driven data extraction under white-box conditions. It introduces CoSPED, a framework integrating consistency-driven losses (Dynamic, Additive, Common) and a Self-Consistency Decoding strategy to stabilize targeted data extraction across model families such as GPT-Neo and Pythia. Empirical results show high extraction rates (ER$_{50}$ around 65%) and clearly outperform prior methods like Ethicist and CLM, while a defense via Rank-One Model Editing (ROME) reduces leakage to near-zero with limited impact on general language abilities (measured by LAMBADA). The work thus highlights both the vulnerability of LLMs to soft-prompt attacks and a practical defense pathway, informing safer design and mitigation strategies for privacy-preserving AI deployment.
Abstract
Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft prompt tuning process. Through extensive experimentation with various combinations, we achieved an extraction rate of 65.2% at a 50-token prefix comparison. Our comparisons of CoSPED with other reference works confirm our superior extraction rates. We evaluate CoSPED on more scenarios, achieving Pythia model extraction rate of 51.7% and introducing cross-model comparison. Finally, we explore defense through Rank-One Model Editing and achieve a reduction in the extraction rate to 1.6%, which proves that our analysis of extraction mechanisms can directly inform effective mitigation strategies against soft prompt-based attacks.
