Table of Contents
Fetching ...

LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data

Eyal German, Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici

TL;DR

LexiMark presents a novel text watermarking approach that embeds a watermark into training data by substituting high-entropy words with higher-entropy synonyms, preserving semantic meaning while enhancing memorization for robust membership verification via MIAs. The method achieves strong AUROC gains across diverse models and datasets, enables efficient dataset-level detection with as few as six records, and demonstrates resilience to common text modifications and post-training updates. Semantic preservation is carefully balanced against detectability, with higher similarity thresholds and contextual synonym selection improving meaning retention. The work offers a practical, language-agnostic tool for protecting proprietary data in LLM training, with open-source resources to support reproducibility and adoption.

Abstract

Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.

LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data

TL;DR

LexiMark presents a novel text watermarking approach that embeds a watermark into training data by substituting high-entropy words with higher-entropy synonyms, preserving semantic meaning while enhancing memorization for robust membership verification via MIAs. The method achieves strong AUROC gains across diverse models and datasets, enables efficient dataset-level detection with as few as six records, and demonstrates resilience to common text modifications and post-training updates. Semantic preservation is carefully balanced against detectability, with higher similarity thresholds and contextual synonym selection improving meaning retention. The work offers a practical, language-agnostic tool for protecting proprietary data in LLM training, with open-source resources to support reproducibility and adoption.

Abstract

Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.

Paper Structure

This paper contains 26 sections, 2 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Our synonym replacement method with K=3 substitutions: “quick,” “jumps,” and “lazy” $\rightarrow$ “speedy,” “leaps,” and “sluggish.”
  • Figure 2: Flowchart illustrating the process of embedding watermarks in text through high-entropy word substitution.
  • Figure 3: AUROC scores obtained using different watermarking techniques on the BookMIA dataset with the LLaMA-1 7B model. Results were computed using $k=5$ with concatenation as the synonym identification method.
  • Figure 4: AUROC scores comparing various synonym identification methods for watermark detection on the BookMIA dataset, highlighting the method with the highest semantic preservation.
  • Figure 5: Semantic similarity evaluation on the BookMIA dataset using the GPT embedding model "text-embedding-3-large," showing the proportion of watermarked samples with cosine similarity above various thresholds.
  • ...and 3 more figures