Table of Contents
Fetching ...

Unmasking the Imposters: How Censorship and Domain Adaptation Affect the Detection of Machine-Generated Tweets

Bryan E. Tuck, Rakesh M. Verma

TL;DR

This work examines how censorship and domain adaptation shape the detectability of machine-generated tweets. By fine-tuning nine model variants (censored and uncensored) from four LLM families on in-domain Twitter data and generating emotion-preserving synthetic tweets, the authors create nine paired datasets for robust detector evaluation using BERTweet, DeBERTaV3, and a soft-ensemble approach, supplemented by stylometric features. The results show that uncensored models achieve greater lexical richness and human-like structure, but at the cost of higher toxicity and substantially weakened detection performance, while censored models remain more detectable but less linguistically diverse. The findings underscore the need for advanced, domain-aware moderation and detectors that can adapt to rapidly evolving generative capabilities, with careful attention to ethical implications and platform-specific dynamics.

Abstract

The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their potential misuse on social media platforms. We present a comprehensive methodology for creating nine Twitter datasets to examine the generative capabilities of four prominent LLMs: Llama 3, Mistral, Qwen2, and GPT4o. These datasets encompass four censored and five uncensored model configurations, including 7B and 8B parameter base-instruction models of the three open-source LLMs. Additionally, we perform a data quality analysis to assess the characteristics of textual outputs from human, "censored," and "uncensored" models, employing semantic meaning, lexical richness, structural patterns, content characteristics, and detector performance metrics to identify differences and similarities. Our evaluation demonstrates that "uncensored" models significantly undermine the effectiveness of automated detection methods. This study addresses a critical gap by exploring smaller open-source models and the ramifications of "uncensoring," providing valuable insights into how domain adaptation and content moderation strategies influence both the detectability and structural characteristics of machine-generated text.

Unmasking the Imposters: How Censorship and Domain Adaptation Affect the Detection of Machine-Generated Tweets

TL;DR

This work examines how censorship and domain adaptation shape the detectability of machine-generated tweets. By fine-tuning nine model variants (censored and uncensored) from four LLM families on in-domain Twitter data and generating emotion-preserving synthetic tweets, the authors create nine paired datasets for robust detector evaluation using BERTweet, DeBERTaV3, and a soft-ensemble approach, supplemented by stylometric features. The results show that uncensored models achieve greater lexical richness and human-like structure, but at the cost of higher toxicity and substantially weakened detection performance, while censored models remain more detectable but less linguistically diverse. The findings underscore the need for advanced, domain-aware moderation and detectors that can adapt to rapidly evolving generative capabilities, with careful attention to ethical implications and platform-specific dynamics.

Abstract

The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their potential misuse on social media platforms. We present a comprehensive methodology for creating nine Twitter datasets to examine the generative capabilities of four prominent LLMs: Llama 3, Mistral, Qwen2, and GPT4o. These datasets encompass four censored and five uncensored model configurations, including 7B and 8B parameter base-instruction models of the three open-source LLMs. Additionally, we perform a data quality analysis to assess the characteristics of textual outputs from human, "censored," and "uncensored" models, employing semantic meaning, lexical richness, structural patterns, content characteristics, and detector performance metrics to identify differences and similarities. Our evaluation demonstrates that "uncensored" models significantly undermine the effectiveness of automated detection methods. This study addresses a critical gap by exploring smaller open-source models and the ramifications of "uncensoring," providing valuable insights into how domain adaptation and content moderation strategies influence both the detectability and structural characteristics of machine-generated text.

Paper Structure

This paper contains 46 sections, 9 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of the experimental workflow. We begin with the TweetEval benchmark, selecting tweets from multiple tasks (emotion, hate speech, offensive language, sentiment, and irony) to fine-tune our LLMs. The emotion dataset tweets serve as prompts for the fine-tuned large language models (LLMs), which generate synthetic, domain-specific text. The resulting human and machine-generated tweets are combined, assessed for data quality across semantic, lexical, structural, and content dimensions, and finally evaluated by our six detection methods (including BERT-based models, stylometric features, and ensembles) to determine whether each sample is human or machine-produced.
  • Figure 2: We plot each model and human text on n-gram diversity (x-axis) and entropy (y-axis). GPT-4o appears in the upper right, exceeding human complexity. Humans and some censored models (e.g., Mistral) cluster near the center, indicating balanced variety. Certain censored models (e.g., Qwen2) remain lower in diversity and entropy, while uncensored models trend upward, increasing complexity compared to censored variants but not always surpassing human levels.
  • Figure 3: Comparison of fine-grained toxicity metrics. Each cell indicates the percentage of text samples exhibiting a given toxicity type. Human content is notably more toxic than most censored outputs, while some uncensored models approach human-level toxicity distributions.
  • Figure 4: Mean Precision, Recall, and F1 performance of our transformer-based detectors (BERTweet and DeBERTa variants, with and without stylometric features, and ensemble methods) tested across different censored and uncensored model outputs. Each heatmap cell represents the average metric score over five independent runs.
  • Figure 5: Average BERTScore across different censored and uncensored model variants. Higher BERTScore values indicate that model outputs are semantically more aligned with the human reference texts. The results show minimal differences in overall semantic fidelity among models, with both censored and uncensored versions achieving scores close to one another. This suggests that while removing moderation constraints may slightly shift linguistic patterns, it does not substantially degrade semantic alignment with the original tweets.
  • ...and 5 more figures