Table of Contents
Fetching ...

The Content Moderator's Dilemma: Removal of Toxic Content and Distortions to Online Discourse

Mahyar Habibi, Dirk Hovy, Carlo Schwarz

TL;DR

This paper proposes and validates a methodology for measuring the content-moderation-induced distortions in online discourse using text embeddings from computational linguistics, and shows that removing toxic Tweets alters the semantic composition of content.

Abstract

There is an ongoing debate about how to moderate toxic speech on social media and the impact of content moderation on online discourse. This paper proposes and validates a methodology for measuring the content-moderation-induced distortions in online discourse using text embeddings from computational linguistics. Applying the method to a representative sample of 5 million US political Tweets, we find that removing toxic Tweets alters the semantic composition of content. This finding is consistent across different embedding models, toxicity metrics, and samples. Importantly, we demonstrate that these effects are not solely driven by toxic language but by the removal of topics often expressed in toxic form. We propose an alternative approach to content moderation that uses generative Large Language Models to rephrase toxic Tweets, preserving their salvageable content rather than removing them entirely. We show that this rephrasing strategy reduces toxicity while minimizing distortions in online content.

The Content Moderator's Dilemma: Removal of Toxic Content and Distortions to Online Discourse

TL;DR

This paper proposes and validates a methodology for measuring the content-moderation-induced distortions in online discourse using text embeddings from computational linguistics, and shows that removing toxic Tweets alters the semantic composition of content.

Abstract

There is an ongoing debate about how to moderate toxic speech on social media and the impact of content moderation on online discourse. This paper proposes and validates a methodology for measuring the content-moderation-induced distortions in online discourse using text embeddings from computational linguistics. Applying the method to a representative sample of 5 million US political Tweets, we find that removing toxic Tweets alters the semantic composition of content. This finding is consistent across different embedding models, toxicity metrics, and samples. Importantly, we demonstrate that these effects are not solely driven by toxic language but by the removal of topics often expressed in toxic form. We propose an alternative approach to content moderation that uses generative Large Language Models to rephrase toxic Tweets, preserving their salvageable content rather than removing them entirely. We show that this rephrasing strategy reduces toxicity while minimizing distortions in online content.

Paper Structure

This paper contains 20 sections, 6 equations, 26 figures, 4 tables.

Figures (26)

  • Figure 1: Content Distortions and Removal of Toxic Content
  • Figure 2: Benchmarking BCD
  • Figure 3: Content Moderation and Topic Shifts
  • Figure 4: Decompostion of Bhatacharya Distance
  • Figure 5: Content Plurality and Rephrasing of Toxic Content
  • ...and 21 more figures