Cross-Lingual Summarization as a Black-Box Watermark Removal Attack
Gokul Ganesan
TL;DR
This paper addresses the brittleness of distributional watermarking for multilingual NLP pipelines by introducing Cross-Lingual Summarization Attack (CLSA), a translate→summarize→(optional back-translate) pipeline that imposes a semantic bottleneck to erase token-level watermark cues. Evaluated across four watermark detectors (KGW, SIR, XSIR, Unigram) and five languages using public MT and multilingual summarization models, CLSA consistently reduces detector AUROC toward near-chance levels while preserving task utility, outperforming monolingual paraphrase and prior cross-lingual attacks like CWRA. The key finding is that the combination of cross-lingual rewriting and content compression disrupts seed token cues, n-gram locality, and semantic neighborhoods more effectively than either component alone, including for cross-lingual-robust schemes like XSIR. The results underscore a practical vulnerability in distributional watermarking for provenance and argue for defenses that integrate cryptographic verification or model attestation, as well as length-aware or multi-modal watermarking, to withstand multilingual processing realities.
Abstract
Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.
