Table of Contents
Fetching ...

Signal Watermark on Large Language Models

Zhenyu Xu, Victor S. Sheng

TL;DR

This paper proposes a watermarking method embedding a specific watermark into the text during its generation by LLMs, based on a pre-defined signal pattern, which ensures the watermark's invisibility to humans but also maintains the quality and grammatical integrity of model-generated text.

Abstract

As Large Language Models (LLMs) become increasingly sophisticated, they raise significant security concerns, including the creation of fake news and academic misuse. Most detectors for identifying model-generated text are limited by their reliance on variance in perplexity and burstiness, and they require substantial computational resources. In this paper, we proposed a watermarking method embedding a specific watermark into the text during its generation by LLMs, based on a pre-defined signal pattern. This technique not only ensures the watermark's invisibility to humans but also maintains the quality and grammatical integrity of model-generated text. We utilize LLMs and Fast Fourier Transform (FFT) for token probability computation and detection of the signal watermark. The unique application of signal processing principles within the realm of text generation by LLMs allows for subtle yet effective embedding of watermarks, which do not compromise the quality or coherence of the generated text. Our method has been empirically validated across multiple LLMs, consistently maintaining high detection accuracy, even with variations in temperature settings during text generation. In the experiment of distinguishing between human-written and watermarked text, our method achieved an AUROC score of 0.97, significantly outperforming existing methods like GPTZero, which scored 0.64. The watermark's resilience to various attacking scenarios further confirms its robustness, addressing significant challenges in model-generated text authentication.

Signal Watermark on Large Language Models

TL;DR

This paper proposes a watermarking method embedding a specific watermark into the text during its generation by LLMs, based on a pre-defined signal pattern, which ensures the watermark's invisibility to humans but also maintains the quality and grammatical integrity of model-generated text.

Abstract

As Large Language Models (LLMs) become increasingly sophisticated, they raise significant security concerns, including the creation of fake news and academic misuse. Most detectors for identifying model-generated text are limited by their reliance on variance in perplexity and burstiness, and they require substantial computational resources. In this paper, we proposed a watermarking method embedding a specific watermark into the text during its generation by LLMs, based on a pre-defined signal pattern. This technique not only ensures the watermark's invisibility to humans but also maintains the quality and grammatical integrity of model-generated text. We utilize LLMs and Fast Fourier Transform (FFT) for token probability computation and detection of the signal watermark. The unique application of signal processing principles within the realm of text generation by LLMs allows for subtle yet effective embedding of watermarks, which do not compromise the quality or coherence of the generated text. Our method has been empirically validated across multiple LLMs, consistently maintaining high detection accuracy, even with variations in temperature settings during text generation. In the experiment of distinguishing between human-written and watermarked text, our method achieved an AUROC score of 0.97, significantly outperforming existing methods like GPTZero, which scored 0.64. The watermark's resilience to various attacking scenarios further confirms its robustness, addressing significant challenges in model-generated text authentication.

Paper Structure

This paper contains 35 sections, 9 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: This figure illustrates two samples: one without a watermark and the other embedded with a watermark, both generated by the text-davinci-003 td003 model. Both samples are produced by the text-davinci-003 model. The Perplexity (PPL) scores, calculated by the same model. It indicate that even with the watermark embedded, the text maintains a regular text quality, similar to high-quality human-written texts.
  • Figure 2: This figure illustrates an example of embedding a signal watermark while generating text by LLMs. The black box displays our top 5 token candidates pool, and the temperature is set to 0. When the model generates text with no watermark, it consistently selects the token with the highest probability as the next token when no specific pattern is followed, as indicated in blue. Meanwhile, when the model generates watermark text using the same prompt, we have a pre-defined pattern sampling from sin(x), dictating the rank of the token to be selected from token candidates pool. Each time a next token is generated, it was chosen according to this pattern, shown in orange. Higher temperatures would perturb this sampling process, leading the model to select tokens that deviate from the most probable one. For examples of text generation at different temperatures, please see Appendix A.
  • Figure 3: We perform Token Probability Re-computation on model-generated text and utilize Fast Fourier Transform (FFT) for watermark detection. 'Real completion' refers to human-written text following the prompt, while 'No watermark' and 'Watermark' are texts generated by the model with same prompt. In (a) part of this figure, Pattern represents in a sinusoidal signal with a period of 10 and 10 samples per period. The upper blue waveform of Real completion, No Watermark (NW), and Watermark (W) shows the rank of the actual token within its candidate pool after the re-computation process. The lower green waveform indicates the log probability value of the current token. In the 'Watermark' plot, we can observe a periodic waveform similar to the pre-defined pattern. In (b) part, it presents the frequency spectrum obtained from applying FFT to the blue waveform in (a). In the FFT on W spectrum, a peak frequency of signal watermark is observed, which corresponds to the pre-defined pattern frequency in the FFT on pattern, highlighted by a red dot at the position of 0.1. In contrast, the spectrum from FFT on Real completion and FFT on NW do not show the same peak frequency. This method of calculating the frequency spectrum via FFT enables the detection of the watermark's presence.
  • Figure 4: Impact of Temperature on Watermark Detection Accuracy and Average PPL for two models: (a) OPT-1.3b (b) text-davinci-003. In scenarios with higher temperature settings, text-davinci-003 maintains a relatively high accuracy rate. In contrast, OPT-1.3b, while outperforming at lower temperatures, exhibits a more significant drop in accuracy as the temperature increases. This difference highlights the distinct response of each model to variations in temperature, particularly impacting their performance in watermark text detection.
  • Figure 5: Impact of Signal Amplitude on Watermark Detection Accuracy and Average PPL for two models: (a) OPT-1.3b (b) text-davinci-003. For both models, increasing the Signal Amplitude results in a rise in perplexity, thereby reducing text quality. However, this adjustment concurrently enhances the accuracy of watermark detection. This phenomenon illustrates a trade-off between text quality and the effectiveness of watermark detection influenced by Signal Amplitude.
  • ...and 4 more figures