Table of Contents
Fetching ...

Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

Gokul Ganesan

TL;DR

This paper addresses the brittleness of distributional watermarking for multilingual NLP pipelines by introducing Cross-Lingual Summarization Attack (CLSA), a translate→summarize→(optional back-translate) pipeline that imposes a semantic bottleneck to erase token-level watermark cues. Evaluated across four watermark detectors (KGW, SIR, XSIR, Unigram) and five languages using public MT and multilingual summarization models, CLSA consistently reduces detector AUROC toward near-chance levels while preserving task utility, outperforming monolingual paraphrase and prior cross-lingual attacks like CWRA. The key finding is that the combination of cross-lingual rewriting and content compression disrupts seed token cues, n-gram locality, and semantic neighborhoods more effectively than either component alone, including for cross-lingual-robust schemes like XSIR. The results underscore a practical vulnerability in distributional watermarking for provenance and argue for defenses that integrate cryptographic verification or model attestation, as well as length-aware or multi-modal watermarking, to withstand multilingual processing realities.

Abstract

Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.

Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

TL;DR

This paper addresses the brittleness of distributional watermarking for multilingual NLP pipelines by introducing Cross-Lingual Summarization Attack (CLSA), a translate→summarize→(optional back-translate) pipeline that imposes a semantic bottleneck to erase token-level watermark cues. Evaluated across four watermark detectors (KGW, SIR, XSIR, Unigram) and five languages using public MT and multilingual summarization models, CLSA consistently reduces detector AUROC toward near-chance levels while preserving task utility, outperforming monolingual paraphrase and prior cross-lingual attacks like CWRA. The key finding is that the combination of cross-lingual rewriting and content compression disrupts seed token cues, n-gram locality, and semantic neighborhoods more effectively than either component alone, including for cross-lingual-robust schemes like XSIR. The results underscore a practical vulnerability in distributional watermarking for provenance and argue for defenses that integrate cryptographic verification or model attestation, as well as length-aware or multi-modal watermarking, to withstand multilingual processing realities.

Abstract

Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) -- translation to a pivot language followed by summarization and optional back-translation -- constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is , with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is , whereas CLSA drives it down to (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.

Paper Structure

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Summary metrics across detectors and languages. Bars aggregate AUROC, AUPRC, Accuracy@thr, F1@thr, EER, and TPR@1% FPR for baselines vs. CLSA. CLSA consistently drives AUROC toward chance (lower effective separability), increases EER, and collapses TPR@1% FPR toward zero while keeping utility high.
  • Figure 2: AUROC by detector and language. CLSA consistently trends toward chance performance across all evaluated combinations.
  • Figure 3: TPR at 1% FPR: CLSA collapses true-positive rates at stringent false-positive operating points, indicating practical detection failure.
  • Figure 4: Equal Error Rate (EER): higher values under CLSA indicate reduced separability between watermarked and non-watermarked content.