Outcome-Constrained Large Language Models for Countering Hate Speech

Lingzi Hong; Pengcheng Luo; Eduardo Blanco; Xiaoying Song

Outcome-Constrained Large Language Models for Countering Hate Speech

Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song

TL;DR

This study experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes: low conversation incivility and non-hateful hater reentry.

Abstract

Automatic counterspeech generation methods have been developed to assist efforts in combating hate speech. Existing research focuses on generating counterspeech with linguistic attributes such as being polite, informative, and intent-driven. However, the real impact of counterspeech in online environments is seldom considered. This study aims to develop methods for generating counterspeech constrained by conversation outcomes and evaluate their effectiveness. We experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes: low conversation incivility and non-hateful hater reentry. Specifically, we experiment with instruction prompts, LLM finetuning, and LLM reinforcement learning (RL). Evaluation results show that our methods effectively steer the generation of counterspeech toward the desired outcomes. Our analyses, however, show that there are differences in the quality and style depending on the model.

Outcome-Constrained Large Language Models for Countering Hate Speech

TL;DR

This study experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes: low conversation incivility and non-hateful hater reentry.

Abstract

Paper Structure (38 sections, 1 figure, 10 tables)

This paper contains 38 sections, 1 figure, 10 tables.

Introduction
Related Work
Generating Counterspeech
Language Generation with Constraints
Methodology
Conversation Outcomes
Conversation Incivility
Hater Reentry Behavior
Outcome-Constrained Counterspeech Generation
Instruction Prompts
LLM Finetuning
Reinforcement Learning with LLM (RL)
Evaluation
Desired Conversation Outcome Metrics
Human Assessments
...and 23 more sections

Figures (1)

Figure 1: Two conversation outcomes (hater ressntry and incivility0 assessed based on the conversation (green box) following up a counterspeech reply (blue box). Comments in the first layer of the conversation tree (i.e., direct replies) are used to model hater reentry. All comments in the conversation tree are used to model conversation incivility. Grey boxes indicate hateful comments; others are non-hateful.

Outcome-Constrained Large Language Models for Countering Hate Speech

TL;DR

Abstract

Outcome-Constrained Large Language Models for Countering Hate Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (1)