Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking
Liyan Xie, Muhammad Siddeek, Mohamed Seif, Andrea J. Goldsmith, Mengdi Wang
TL;DR
The paper tackles the problem of identifying post-generation edits to watermarked LLM outputs by introducing a combinatorial pattern-based watermarking scheme that partitions the vocabulary into subsets and deterministically enforces a cyclic pattern during generation. It provides both a global watermark detection statistic and lightweight local statistics to localize edits, formalizing the problem with clear metrics and evaluating on open-source LLMs across diverse edits. Key contributions include the formal problem definition, a concrete pattern-based watermarking framework (with AB and ACADBD patterns), analytical bounds for edit-detection false alarms, and extensive empirical validation showing strong local edit localization and competitive watermark detectability. The approach improves accountability and traceability in AI-generated content by enabling token-level localization of edits while maintaining reliable watermark verification, with practical implications for collaboration, publishing, and security contexts.
Abstract
Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.
