Table of Contents
Fetching ...

Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

Giacomo Fidone, Lucia Passaro, Riccardo Guidotti

TL;DR

This work tackles the challenge of evaluating online moderation by introducing cosmos, a Large Language Model–powered agent-based model that simulates OSN conversations and runs parallel factual and counterfactual scenarios to measure moderation effects. By grounding agents in socio-demographic and psychological prompts and coupling them with a memory-driven moderation interface, cosmos demonstrates realistic toxic behavior, contagion across threads, and the superior effectiveness of personalized ex ante moderation strategies. The study systematically explores ex ante and ex post interventions, including One-Size-Fits-All, PMI variants, and Ban-based throttling, revealing trade-offs between toxicity reduction and content loss, as well as the influence of psychological traits on moderation outcomes. The results suggest cosmos can complement field observations and automated moderation tools, offering a controllable, reproducible platform for hypothesis testing and policy analysis in OSN moderation, while acknowledging limitations in LLM reliability, realism validation, and scalability.

Abstract

Online Social Networks (OSNs) widely adopt content moderation to mitigate the spread of abusive and toxic discourse. Nonetheless, the real effectiveness of moderation interventions remains unclear due to the high cost of data collection and limited experimental control. The latest developments in Natural Language Processing pave the way for a new evaluation approach. Large Language Models (LLMs) can be successfully leveraged to enhance Agent-Based Modeling and simulate human-like social behavior with unprecedented degree of believability. Yet, existing tools do not support simulation-based evaluation of moderation strategies. We fill this gap by designing a LLM-powered simulator of OSN conversations enabling a parallel, counterfactual simulation where toxic behavior is influenced by moderation interventions, keeping all else equal. We conduct extensive experiments, unveiling the psychological realism of OSN agents, the emergence of social contagion phenomena and the superior effectiveness of personalized moderation strategies.

Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

TL;DR

This work tackles the challenge of evaluating online moderation by introducing cosmos, a Large Language Model–powered agent-based model that simulates OSN conversations and runs parallel factual and counterfactual scenarios to measure moderation effects. By grounding agents in socio-demographic and psychological prompts and coupling them with a memory-driven moderation interface, cosmos demonstrates realistic toxic behavior, contagion across threads, and the superior effectiveness of personalized ex ante moderation strategies. The study systematically explores ex ante and ex post interventions, including One-Size-Fits-All, PMI variants, and Ban-based throttling, revealing trade-offs between toxicity reduction and content loss, as well as the influence of psychological traits on moderation outcomes. The results suggest cosmos can complement field observations and automated moderation tools, offering a controllable, reproducible platform for hypothesis testing and policy analysis in OSN moderation, while acknowledging limitations in LLM reliability, realism validation, and scalability.

Abstract

Online Social Networks (OSNs) widely adopt content moderation to mitigate the spread of abusive and toxic discourse. Nonetheless, the real effectiveness of moderation interventions remains unclear due to the high cost of data collection and limited experimental control. The latest developments in Natural Language Processing pave the way for a new evaluation approach. Large Language Models (LLMs) can be successfully leveraged to enhance Agent-Based Modeling and simulate human-like social behavior with unprecedented degree of believability. Yet, existing tools do not support simulation-based evaluation of moderation strategies. We fill this gap by designing a LLM-powered simulator of OSN conversations enabling a parallel, counterfactual simulation where toxic behavior is influenced by moderation interventions, keeping all else equal. We conduct extensive experiments, unveiling the psychological realism of OSN agents, the emergence of social contagion phenomena and the superior effectiveness of personalized moderation strategies.

Paper Structure

This paper contains 44 sections, 5 equations, 23 figures, 3 tables, 1 algorithm.

Figures (23)

  • Figure 1: Example of factual thread and its counterfactual version from cosmos experiments. In the counterfactual simulation, Agent 19 receives a moderation message at time 1 for having submitted a toxic post. The memory of this message influences Agent's 19 behavior at subsequent timestamps. For example, at time 3 it is effective at mitigating the toxicity of Agent 19's reply. In turn, this change has cascading effects on lower nodes: although Agent 6 has no memory of past moderation messages, at time 4 it reduces its toxicity. Some profile features of the two agents are displayed on the left (for full profiles, see Appendix A).
  • Figure 2: Median toxicity of agents in the sub-population simulation, compared to their median toxicity in full-population simulations. Least toxic agent marked in red.
  • Figure 3: Ex ante PMI messages encoded with BERT.
  • Figure 4: Mass divergence $\Delta M$ over each OCEAN trait for different intensity values across moderation strategies. Statistically significant reductions ($\Delta M {<} 0$) or increases ($\Delta M {>} 0$) are marked with an asterisk for Mann-Whitney with $p\text{-value} {<} 0.05$.
  • Figure 5: Quantile divergence $\Delta q$ (y-axis) computed on $q \in [0.0, 1.0]$ (x-axis) and averaged across simulation runs, for each moderation strategy. The error band represents standard deviations.
  • ...and 18 more figures