Signature vs. Substance: Evaluating the Balance of Adversarial Resistance and Linguistic Quality in Watermarking Large Language Models
William Guo, Adaku Uchendu, Ana Smith
TL;DR
This work analyzes the trade-off between adversarial resistance and linguistic quality in watermarking LLM outputs. It evaluates four watermarking techniques (KGW, SIR, Unbiased, EWD) under paraphrasing and back translation attacks using a MarkLLM pipeline on an English C4 subset, employing $AUC$-based detection and a suite of linguistic quality metrics. By examining POS distributions, sentiment, Levenshtein distance, and descriptive statistics, the study reveals that Unbiased generally offers the best balance between detectability and preserving writing style, while back translation more effectively erases watermark signals than paraphrasing. The findings show that longer, more noun-rich, syntactically dense text tends to resist watermark erosion, whereas positive sentiment correlates with weaker robustness, providing concrete guidance for designing robust watermarking schemes and evaluating linguistic impact.
Abstract
To mitigate the potential harms of Large Language Models (LLMs)generated text, researchers have proposed watermarking, a process of embedding detectable signals within text. With watermarking, we can always accurately detect LLM-generated texts. However, recent findings suggest that these techniques often negatively affect the quality of the generated texts, and adversarial attacks can strip the watermarking signals, causing the texts to possibly evade detection. These findings have created resistance in the wide adoption of watermarking by LLM creators. Finally, to encourage adoption, we evaluate the robustness of several watermarking techniques to adversarial attacks by comparing paraphrasing and back translation (i.e., English $\to$ another language $\to$ English) attacks; and their ability to preserve quality and writing style of the unwatermarked texts by using linguistic metrics to capture quality and writing style of texts. Our results suggest that these watermarking techniques preserve semantics, deviate from the writing style of the unwatermarked texts, and are susceptible to adversarial attacks, especially for the back translation attack.
