Table of Contents
Fetching ...

Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing

Neemesh Yadav, Jiarui Liu, Francesco Ortu, Roya Ensafi, Zhijing Jin, Rada Mihalcea

TL;DR

This work tackles the opacity of cross-country content moderation by applying NLP and explainability methods to analyze censorship decisions across five countries (Germany, France, India, Turkey, Russia) from 2011–2020 using Twitter Stream Grab data. It defines two tasks: reverse-engineering moderation decisions with encoder LMs and zero-shot LLMs, and explaining those decisions via SHAP values and LLM-guided reasoning, validated through human evaluation. The study finds that LLM classifiers can reproduce real-world moderation with noticeable cross-country variation, and that SHAP and LLM explanations reveal event-aligned patterns while highlighting limitations in faithful, region-specific reasoning. These insights advance auditing and responsible deployment of AI-driven censorship analysis, while underscoring the need for careful handling of bias, data scope, and ethical considerations in cross-cultural moderation research.

Abstract

The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at https://github.com/causalNLP/censorship

Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing

TL;DR

This work tackles the opacity of cross-country content moderation by applying NLP and explainability methods to analyze censorship decisions across five countries (Germany, France, India, Turkey, Russia) from 2011–2020 using Twitter Stream Grab data. It defines two tasks: reverse-engineering moderation decisions with encoder LMs and zero-shot LLMs, and explaining those decisions via SHAP values and LLM-guided reasoning, validated through human evaluation. The study finds that LLM classifiers can reproduce real-world moderation with noticeable cross-country variation, and that SHAP and LLM explanations reveal event-aligned patterns while highlighting limitations in faithful, region-specific reasoning. These insights advance auditing and responsible deployment of AI-driven censorship analysis, while underscoring the need for careful handling of bias, data scope, and ethical considerations in cross-cultural moderation research.

Abstract

The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at https://github.com/causalNLP/censorship

Paper Structure

This paper contains 56 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Category Distribution and Misclassification Patterns Across Countries. Figure (a) illustrates how samples are distributed across different censorship categories, providing insights into the relative prevalence of each category within the dataset. Figure (b) depicts the classified categories for our best-performing model across countries, showing how misclassified samples are distributed among the predefined categories. Notably, compared with the ground truth (Figure (a)), a substantial portion of the misclassified instances fall into the Other category, suggesting that these cases might be harder to categorize accurately, possibly due to overlapping features.
  • Figure 2: Unique Token Distribution Over Time. The figure illustrates the number of unique tokens that played a crucial role in the model's censorship predictions across different years and countries. Notably, peaks in the distribution often coincide with political or societal events that may have caused increased censorship activity.
  • Figure 3: t-SNE visualization of all censored posts from the training set, per country.
  • Figure 4: WordClouds showing top-300 most important keywords for all 6 categories, except stopwords.
  • Figure 5: Shapley Bar Plots for all 5 Individual Countries.