Table of Contents
Fetching ...

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Hyeseon Ahn, Shinwoo Park, Suyeon Woo, Yo-Sub Han

TL;DR

The paper addresses the vulnerability of LLM watermarking to spoofing attacks by introducing DITTO, a knowledge-distillation-based framework that learns and replays a target model's watermark under black-box constraints. It demonstrates that watermark radioactivity can be exploited to impersonate a victim's watermark across multiple schemes, including green-list and SynthID, without a clear trade-off between attack strength and text quality. The findings reveal a critical security gap in provenance verification, showing that detectors can be misled into attributing outputs to the wrong model, and argue for authenticity-based defenses and cryptographic approaches to binding watermarks to model identity. The work underscores the need to move beyond presence detection toward robust, adversarially resilient provenance technologies for high-stakes AI deployments.

Abstract

The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

TL;DR

The paper addresses the vulnerability of LLM watermarking to spoofing attacks by introducing DITTO, a knowledge-distillation-based framework that learns and replays a target model's watermark under black-box constraints. It demonstrates that watermark radioactivity can be exploited to impersonate a victim's watermark across multiple schemes, including green-list and SynthID, without a clear trade-off between attack strength and text quality. The findings reveal a critical security gap in provenance verification, showing that detectors can be misled into attributing outputs to the wrong model, and argue for authenticity-based defenses and cryptographic approaches to binding watermarks to model identity. The work underscores the need to move beyond presence detection toward robust, adversarially resilient provenance technologies for high-stakes AI deployments.

Abstract

The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.

Paper Structure

This paper contains 27 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We 'do something bad' to reveal the fragility of current LLM watermarking schemes. An attacker’s model is trained to imitate the watermark of a trusted victim model, enabling it to generate content (e.g., fake news) that is then falsely attributed to the victim by a detector.
  • Figure 2: An overview of the DITTO framework. Our method consists of three main stages: (1) Watermark Inheritance, where a student model learns the teacher's watermarked patterns via knowledge distillation; (2) Watermark Extraction, where the watermark is isolated by analyzing averaged logit differences, both globally and for specific text prefixes; and (3) the Spoofing Attack, where the extracted signal is added to the attacker's logits to imitate the victim's watermark.
  • Figure 3: The effect of varying the $\alpha$ parameter on the p-value and Perplexity. The complete experimental results for this analysis are available in Table \ref{['tab:alpha_full']} in Appendix \ref{['appendix:full']}.
  • Figure 4: Impact of the scaling parameter $\alpha$ on spoofing performance against SynthID on the Dolly CW dataset.