DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Hyeseon Ahn; Shinwoo Park; Suyeon Woo; Yo-Sub Han

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Hyeseon Ahn, Shinwoo Park, Suyeon Woo, Yo-Sub Han

TL;DR

The paper addresses the vulnerability of LLM watermarking to spoofing attacks by introducing DITTO, a knowledge-distillation-based framework that learns and replays a target model's watermark under black-box constraints. It demonstrates that watermark radioactivity can be exploited to impersonate a victim's watermark across multiple schemes, including green-list and SynthID, without a clear trade-off between attack strength and text quality. The findings reveal a critical security gap in provenance verification, showing that detectors can be misled into attributing outputs to the wrong model, and argue for authenticity-based defenses and cryptographic approaches to binding watermarks to model identity. The work underscores the need to move beyond presence detection toward robust, adversarially resilient provenance technologies for high-stakes AI deployments.

Abstract

The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

TL;DR

Abstract

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)