Table of Contents
Fetching ...

Stylometric Watermarks for Large Language Models

Georg Niess, Roman Kern

TL;DR

The paper tackles the rising challenge of distinguishing human- from machine-generated text and enabling accountability for proprietary LLMs. It introduces a novel watermarking approach that controls stylometric features by deriving a per-sentence semantic key to steer token probabilities during generation. Two features, acrostic cues and sensorimotor norms, are employed, with keys produced via semantic zero-shot classification and detected through statistical hypothesis testing. The results demonstrate robust detection for texts of three or more sentences and resilience to cyclic translation attacks, all without requiring extra fine-tuning or external detectors, suggesting practical utility for enforcing accountability in large language models.

Abstract

The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. Addressing this, we propose a novel method for generating watermarks that strategically alters token probabilities during generation. Unlike previous works, this method uniquely employs linguistic features such as stylometry. Concretely, we introduce acrostica and sensorimotor norms to LLMs. Further, these features are parameterized by a key, which is updated every sentence. To compute this key, we use semantic zero shot classification, which enhances resilience. In our evaluation, we find that for three or more sentences, our method achieves a false positive and false negative rate of 0.02. For the case of a cyclic translation attack, we observe similar results for seven or more sentences. This research is of particular of interest for proprietary LLMs to facilitate accountability and prevent societal harm.

Stylometric Watermarks for Large Language Models

TL;DR

The paper tackles the rising challenge of distinguishing human- from machine-generated text and enabling accountability for proprietary LLMs. It introduces a novel watermarking approach that controls stylometric features by deriving a per-sentence semantic key to steer token probabilities during generation. Two features, acrostic cues and sensorimotor norms, are employed, with keys produced via semantic zero-shot classification and detected through statistical hypothesis testing. The results demonstrate robust detection for texts of three or more sentences and resilience to cyclic translation attacks, all without requiring extra fine-tuning or external detectors, suggesting practical utility for enforcing accountability in large language models.

Abstract

The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. Addressing this, we propose a novel method for generating watermarks that strategically alters token probabilities during generation. Unlike previous works, this method uniquely employs linguistic features such as stylometry. Concretely, we introduce acrostica and sensorimotor norms to LLMs. Further, these features are parameterized by a key, which is updated every sentence. To compute this key, we use semantic zero shot classification, which enhances resilience. In our evaluation, we find that for three or more sentences, our method achieves a false positive and false negative rate of 0.02. For the case of a cyclic translation attack, we observe similar results for seven or more sentences. This research is of particular of interest for proprietary LLMs to facilitate accountability and prevent societal harm.
Paper Structure (30 sections, 4 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our method, where during the text generation a semantic key is derived from each sentence (orange highlight). The key controls the update of the probabilities of the generated tokens in the following sentence to reflect stylometric features, which represent our watermark (green highlight). The detection of the watermark will work for any sequence of sentences from a longer document (i.e., not limited to an entire response).
  • Figure 2: Plot of the 156 prompts with unaltered (orange) and watermarked responses (blue), with a zoomed in chart on the right side to highlight the level of significance, with the length of the response in sentences as the x-axis. Dots above the threshold line are considered to be statistically significant. There are no false positives for responses of more than 6 sentences.
  • Figure 3: Plot of the 156 prompts for normal and watermarked responses after being attacked by a cyclic translation. Even for this type of attack, the watermark was recoverable for the majority of prompts. For responses longer than 7 sentences, the attack was successful only in 9 cases.
  • Figure 4: Comparison of the individual contribution of the two feature types. On the left the results for the sensorimotor features are shown and the results of the acrostica on the right,