Can Language Model Moderators Improve the Health of Online Discourse?

Hyundong Cho; Shuai Liu; Taiwei Shi; Darpan Jain; Basem Rizk; Yuyang Huang; Zixun Lu; Nuan Wen; Jonathan Gratch; Emilio Ferrara; Jonathan May

Can Language Model Moderators Improve the Health of Online Discourse?

Hyundong Cho, Shuai Liu, Taiwei Shi, Darpan Jain, Basem Rizk, Yuyang Huang, Zixun Lu, Nuan Wen, Jonathan Gratch, Emilio Ferrara, Jonathan May

TL;DR

The paper tackles the challenge of scaling online moderation by formalizing what makes a conversational moderator effective and by designing a safe, realistic evaluation framework to test language-model moderators offline. It compares prosocial-dialogue baselines with prompted, instruction-tuned LMs guided by conflict-resolution and prosocial techniques, revealing that prompted LMs can give specific and fair feedback but still struggle to consistently increase user cooperation and respect. The study demonstrates that Socratic- and CBT-inspired prompting (GPT-Socratic) yields the strongest overall results among the explored approaches, though evaluator perspective significantly influences perceived effectiveness. By releasing its evaluation framework and annotated dataset, the work provides a foundation for scalable, safer moderation research that can help improve the health of online discourse while highlighting remaining challenges in aligning automated moderation with diverse user behaviors. Overall, the findings suggest promising avenues for automation-assisted moderation but emphasize careful consideration of human-in-the-loop design and evaluation biases when deploying such systems.

Abstract

Conversational moderation of online communities is crucial to maintaining civility for a constructive environment, but it is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier to aid human moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establish a systematic definition of conversational moderation effectiveness grounded on moderation literature and establish design criteria for conducting realistic yet safe evaluation. We then propose a comprehensive evaluation framework to assess models' moderation capabilities independently of human intervention. With our framework, we conduct the first known study of language models as conversational moderators, finding that appropriately prompted models that incorporate insights from social science can provide specific and fair feedback on toxic behavior but struggle to influence users to increase their levels of respect and cooperation.

Can Language Model Moderators Improve the Health of Online Discourse?

TL;DR

Abstract

Paper Structure (41 sections, 13 figures, 5 tables)

This paper contains 41 sections, 13 figures, 5 tables.

Introduction
Evaluating Conversational Moderation
Definition of moderation effectiveness
Metrics for conversational moderation effectiveness
Experimental design criteria
Evaluation framework overview
Controversial conversation stubs
Conversation continuation
Survey questions
Automated Conversational Moderation
Prosocial dialogue models
Prompted LMs
Experiment Details
Evaluation infrastructure
Annotation collection
...and 26 more sections

Figures (13)

Figure 1: While banning users or deleting their comments may push them towards echo chambers (left), conversational moderation can guide users towards more constructive behavior (right). Recent developments in instruction-tuned language models with conversational capabilities present an opportunity to perform conversational moderation at scale and improve the health of online discourse.
Figure 2: An overview of our evaluation framework. (1) We extract conversations with controversial comments from Reddit and use these as the seed conversations. (2) Moderator bots continue the seed conversations with participants who act as the moderated user. (3) At the end of the conversation, the participants answer a survey about the moderator and their experience.
Figure 3: An overview of the self-talk method for designing prompts for LMs. We keep the Reddit user prompt constant while we refine the moderator prompt iteratively after examining the generated conversations.
Figure 4: Survey results for evaluations done in first-person point of view. Error bars are standard error and bold numbers indicates statistically significant differences (at $p<0.05$) with the best performing moderator on each metric, which is GPT-Socratic for all metrics. Numbers next to the label in the legend are the number of samples annotated for each bot.
Figure 5: Survey results for evaluations done in third-person point of view. The diagram is annotated with the same method as \ref{['fig:first_pov_results']}. Most trends from the first-person point of view apply here, but while scores for specific and fair remain similar, there is a statistically significant drop ($p<0.05$) for all GPT-based models for cooperative and respectful.
...and 8 more figures

Can Language Model Moderators Improve the Health of Online Discourse?

TL;DR

Abstract

Can Language Model Moderators Improve the Health of Online Discourse?

Authors

TL;DR

Abstract

Table of Contents

Figures (13)