Collaborative Content Moderation in the Fediverse
Haris Bin Zia, Aravindh Raman, Ignacio Castro, Gareth Tyson
TL;DR
This work tackles content moderation in the decentralized Fediverse, where individual instances must balance local policies with limited resources. It introduces FedMod, a peer-to-peer federated learning system that exchanges partially trained model parameters among similar instances to improve moderation without sharing raw posts. FedMod demonstrates robust improvements across three tasks—harmful content detection, bot content detection, and content warning assignment—achieving average per-instance macro-F1 scores around 0.71, 0.73, and 0.58, respectively, and shows how hashsim-based peer selection, scalable pre-sampling, and policy alignment affect performance. The approach highlights practical pathways for privacy-preserving, collaborative moderation in distributed social networks and outlines future work on reputation, longitudinal dynamics, and deployment with administrators.
Abstract
The Fediverse, a group of interconnected servers providing a variety of interoperable services (e.g. micro-blogging in Mastodon) has gained rapid popularity. This sudden growth, partly driven by Elon Musk's acquisition of Twitter, has created challenges for administrators though. This paper focuses on one particular challenge: content moderation, e.g. the need to remove spam or hate speech. While centralized platforms like Facebook and Twitter rely on automated tools for moderation, their dependence on massive labeled datasets and specialized infrastructure renders them impractical for decentralized, low-resource settings like the Fediverse. In this work, we design and evaluate FedMod, a collaborative content moderation system based on federated learning. Our system enables servers to exchange parameters of partially trained local content moderation models with similar servers, creating a federated model shared among collaborating servers. FedMod demonstrates robust performance on three different content moderation tasks: harmful content detection, bot content detection, and content warning assignment, achieving average per-server macro-F1 scores of 0.71, 0.73, and 0.58, respectively.
