Table of Contents
Fetching ...

Unmasking information manipulation: A quantitative approach to detecting Copy-pasta, Rewording, and Translation on Social Media

Manon Richard, Lisa Giordani, Cristian Brokate, Jean Liénard

TL;DR

The paper tackles the problem of detecting information manipulation on social media by jointly identifying Copy-Pasta, Rewording, and Translation using a unified framework called the $3\Delta$-space. It computes three proximity measures—$\Delta_{semantic}$, $\Delta_{grapheme}$, and $\Delta_{language}$—to label message pairs and detect near-duplicate clusters that indicate coordination, demonstrated on both synthetic data generated with ChatGPT/DeepL and a real Twitter Venezuelan dataset. The results show strong semantic discrimination with USE, competitive grapheme-based detection (Levenshtein, etc.), and revealing network-level patterns such as distinct account typologies and narrative focuses, including political, entertainment, and alcohol-themed content. The approach offers a scalable, language-robust tool for identifying manipulated and translated content, with practical implications for moderation, tracking AI-generated campaigns, and studying large-scale disinformation operations.

Abstract

This study proposes a comprehensive methodology for identifying three techniques utilized in foreign-operated information manipulation campaigns: Copy-Pasta, Rewording, and Translation. Our approach, dubbed the ``$3Δ$-space duplicate methodology'', quantifies the semantic, grapheme, and language aspects of messages. Computing pairwise distances within these dimensions enables detection of abnormally close messages that are likely part of a coordinated campaign. We validate our approach using a synthetic dataset generated with ChatGPT and DeepL, further applying it to a real-world dataset on Venezuelan actors from Twitter Transparency. Our method successfully identifies all three types of inauthentic duplicates in the synthetic dataset, and is able to uncover inauthentic duplicates across political, commercial, and entertainment contexts in the Twitter dataset. The distinct focus on clustered alterations to messages, rather than individual messages, makes our approach efficient and effective at detecting large-scale instances of textual manipulation, including AI-generated ones. Moreover, our method offers a robust tool for identifying translated content, overlooked in previous research. This research also represents the first comprehensive analysis of copy-pasta detection, providing a reliable technique for tracking duplicate textual content across social networks.

Unmasking information manipulation: A quantitative approach to detecting Copy-pasta, Rewording, and Translation on Social Media

TL;DR

The paper tackles the problem of detecting information manipulation on social media by jointly identifying Copy-Pasta, Rewording, and Translation using a unified framework called the -space. It computes three proximity measures—, , and —to label message pairs and detect near-duplicate clusters that indicate coordination, demonstrated on both synthetic data generated with ChatGPT/DeepL and a real Twitter Venezuelan dataset. The results show strong semantic discrimination with USE, competitive grapheme-based detection (Levenshtein, etc.), and revealing network-level patterns such as distinct account typologies and narrative focuses, including political, entertainment, and alcohol-themed content. The approach offers a scalable, language-robust tool for identifying manipulated and translated content, with practical implications for moderation, tracking AI-generated campaigns, and studying large-scale disinformation operations.

Abstract

This study proposes a comprehensive methodology for identifying three techniques utilized in foreign-operated information manipulation campaigns: Copy-Pasta, Rewording, and Translation. Our approach, dubbed the ``-space duplicate methodology'', quantifies the semantic, grapheme, and language aspects of messages. Computing pairwise distances within these dimensions enables detection of abnormally close messages that are likely part of a coordinated campaign. We validate our approach using a synthetic dataset generated with ChatGPT and DeepL, further applying it to a real-world dataset on Venezuelan actors from Twitter Transparency. Our method successfully identifies all three types of inauthentic duplicates in the synthetic dataset, and is able to uncover inauthentic duplicates across political, commercial, and entertainment contexts in the Twitter dataset. The distinct focus on clustered alterations to messages, rather than individual messages, makes our approach efficient and effective at detecting large-scale instances of textual manipulation, including AI-generated ones. Moreover, our method offers a robust tool for identifying translated content, overlooked in previous research. This research also represents the first comprehensive analysis of copy-pasta detection, providing a reliable technique for tracking duplicate textual content across social networks.
Paper Structure (15 sections, 8 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 8 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Schema of synthetic dataset creation.
  • Figure 2: Density distribution of USE (a) and ADA-002 (b) distances for synthetic dataset message pairs. The optimal thresholds distinguishing between identical-meaning pairs (Copy-Pasta, Rewording and Translation) and the Control group are shown with dashed lines.
  • Figure 3: ROC curve benchmarking algorithm detection of Copy-Pasta vs. Rewording. Markers correspond to the optimal solution according to the $J$ statistic. The 95% confidence intervals computed from boostrapping are shown with error bars. Overall, the Gzip, Levenshtein, and Ratcliff-Obershelp trio exhibit similar performances on the synthetic dataset, and outperform the two bigram-based approaches.
  • Figure 4: Confusion matrix demonstrating algorithm performance on the synthetic experiment with semantic distance optimal threshold ($\tau_s=0.33$) and grapheme distance optimal threshold ($\tau_p=0.31$).
  • Figure 5: Distribution of the $3\Delta$-space distances computed for message pairs of the synthetic dataset. Plot densities represent each transformation, also including "Control" cases. The left plot shows the same-language condition, where the three possible categories are Copy-Pasta, Rewording and Control. The right plot shows the different-language condition and its two categories: Translation and Control. A few representative examples are annotated in the plots and shown below, with "plus" glyphs denoting well-classified and "cross" glyphs denoting badly classified. The color code of the examples match the category, thus $\pmb +$ is a true positive for Copy-Pasta, and $\pmb\times$ is a false positive for Copy-Pasta. As these examples illustrate, the relative overlap between the Copy-Pasta and Rewording can be attributed to ambiguous pairs whose ground truth labels could fall in either category.
  • ...and 4 more figures