Table of Contents
Fetching ...

Ideal Attribution and Faithful Watermarks for Language Models

Min Jae Song, Kameron Shahabi

TL;DR

This work reframes attribution and watermarking for language-model outputs as a problem of ideal attribution mechanisms grounded in a ledger of prompt–response transcripts. By formalizing an ideal attribution mechanism and transcript-level attribution maps, it provides a principled foundation for designing watermarking schemes that are undetectable and faithful to a chosen attribution target. The authors show how ideal pseudorandom codes and block-aligned strategies yield undetectable, faithful watermarking for the uniform distribution, and they extend the analysis to general language models, identifying when faithfulness can be maintained and when it cannot. A core contribution is the introduction of anytime soundness and time-coarsening to reconcile time-invariant verification with time-dependent attribution, along with constructions that couple digital signatures with PRCs to achieve unforgeability in the idealized setting. Collectively, the framework offers a rigorous, cryptography-inspired roadmap for robust, verifiable provenance mechanisms in practical watermarking systems and guides future work on unforgeable and robust primitives for real-world LLM ecosystems.

Abstract

We introduce ideal attribution mechanisms, a formal abstraction for reasoning about attribution decisions over strings. At the core of this abstraction lies the ledger, an append-only log of the prompt-response interaction history between a model and its user. Each mechanism produces deterministic decisions based on the ledger and an explicit selection criterion, making it well-suited to serve as a ground truth for attribution. We frame the design goal of watermarking schemes as faithful representation of ideal attribution mechanisms. This novel perspective brings conceptual clarity, replacing piecemeal probabilistic statements with a unified language for stating the guarantees of each scheme. It also enables precise reasoning about desiderata for future watermarking schemes, even when no current construction achieves them, since the ideal functionalities are specified first. In this way, the framework provides a roadmap that clarifies which guarantees are attainable in an idealized setting and worth pursuing in practice.

Ideal Attribution and Faithful Watermarks for Language Models

TL;DR

This work reframes attribution and watermarking for language-model outputs as a problem of ideal attribution mechanisms grounded in a ledger of prompt–response transcripts. By formalizing an ideal attribution mechanism and transcript-level attribution maps, it provides a principled foundation for designing watermarking schemes that are undetectable and faithful to a chosen attribution target. The authors show how ideal pseudorandom codes and block-aligned strategies yield undetectable, faithful watermarking for the uniform distribution, and they extend the analysis to general language models, identifying when faithfulness can be maintained and when it cannot. A core contribution is the introduction of anytime soundness and time-coarsening to reconcile time-invariant verification with time-dependent attribution, along with constructions that couple digital signatures with PRCs to achieve unforgeability in the idealized setting. Collectively, the framework offers a rigorous, cryptography-inspired roadmap for robust, verifiable provenance mechanisms in practical watermarking systems and guides future work on unforgeable and robust primitives for real-world LLM ecosystems.

Abstract

We introduce ideal attribution mechanisms, a formal abstraction for reasoning about attribution decisions over strings. At the core of this abstraction lies the ledger, an append-only log of the prompt-response interaction history between a model and its user. Each mechanism produces deterministic decisions based on the ledger and an explicit selection criterion, making it well-suited to serve as a ground truth for attribution. We frame the design goal of watermarking schemes as faithful representation of ideal attribution mechanisms. This novel perspective brings conceptual clarity, replacing piecemeal probabilistic statements with a unified language for stating the guarantees of each scheme. It also enables precise reasoning about desiderata for future watermarking schemes, even when no current construction achieves them, since the ideal functionalities are specified first. In this way, the framework provides a roadmap that clarifies which guarantees are attainable in an idealized setting and worth pursuing in practice.

Paper Structure

This paper contains 48 sections, 10 theorems, 66 equations, 1 algorithm.

Key Result

Proposition 2.3

Every transcript-level attribution map ${\mathsf{R}}:\{0,1\}^*\times\{0,1\}^*\to 2^{\{0,1\}^*}$ satisfying Definition def:transcript-attribution can be induced by some selection rule $\mathsf{Z}:\{0,1\}^*\times\{0,1\}^*\times\{0,1\}^* \to \{0,1\}$. Conversely, any transcript-level attribution map sa

Theorems & Definitions (39)

  • Definition 1.1: Informal version of Definition \ref{['def:watermarking-faithful']}
  • Definition 2.1: Transcript-level attribution
  • Definition 2.2: Ledger-level attribution
  • Proposition 2.3: Surjection from selection rules to attribution maps
  • proof
  • Example 2.4: Non-injectivity
  • Definition 2.5: Attribution soundness
  • Definition 2.6: Predicate
  • Definition 2.7: Hamming
  • Definition 2.8: Predicate-based expansion
  • ...and 29 more