Ideal Attribution and Faithful Watermarks for Language Models
Min Jae Song, Kameron Shahabi
TL;DR
This work reframes attribution and watermarking for language-model outputs as a problem of ideal attribution mechanisms grounded in a ledger of prompt–response transcripts. By formalizing an ideal attribution mechanism and transcript-level attribution maps, it provides a principled foundation for designing watermarking schemes that are undetectable and faithful to a chosen attribution target. The authors show how ideal pseudorandom codes and block-aligned strategies yield undetectable, faithful watermarking for the uniform distribution, and they extend the analysis to general language models, identifying when faithfulness can be maintained and when it cannot. A core contribution is the introduction of anytime soundness and time-coarsening to reconcile time-invariant verification with time-dependent attribution, along with constructions that couple digital signatures with PRCs to achieve unforgeability in the idealized setting. Collectively, the framework offers a rigorous, cryptography-inspired roadmap for robust, verifiable provenance mechanisms in practical watermarking systems and guides future work on unforgeable and robust primitives for real-world LLM ecosystems.
Abstract
We introduce ideal attribution mechanisms, a formal abstraction for reasoning about attribution decisions over strings. At the core of this abstraction lies the ledger, an append-only log of the prompt-response interaction history between a model and its user. Each mechanism produces deterministic decisions based on the ledger and an explicit selection criterion, making it well-suited to serve as a ground truth for attribution. We frame the design goal of watermarking schemes as faithful representation of ideal attribution mechanisms. This novel perspective brings conceptual clarity, replacing piecemeal probabilistic statements with a unified language for stating the guarantees of each scheme. It also enables precise reasoning about desiderata for future watermarking schemes, even when no current construction achieves them, since the ideal functionalities are specified first. In this way, the framework provides a roadmap that clarifies which guarantees are attainable in an idealized setting and worth pursuing in practice.
