Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

Amey Hengle; Aswini Kumar; Sahajpreet Singh; Anil Bandhakavi; Md Shad Akhtar; Tanmoy Chakroborty

Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

Amey Hengle, Aswini Kumar, Sahajpreet Singh, Anil Bandhakavi, Md Shad Akhtar, Tanmoy Chakroborty

TL;DR

CoARL is introduced, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements and outperforms existing benchmarks in intent-conditioned counterspeech generation.

Abstract

Counterspeech, defined as a response to mitigate online hate speech, is increasingly used as a non-censorial solution. Addressing hate speech effectively involves dispelling the stereotypes, prejudices, and biases often subtly implied in brief, single-sentence statements or abuses. These implicit expressions challenge language models, especially in seq2seq tasks, as model performance typically excels with longer contexts. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. CoARL's first two phases involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech. The final phase uses reinforcement learning to fine-tune outputs for effectiveness and non-toxicity. CoARL outperforms existing benchmarks in intent-conditioned counterspeech generation, showing an average improvement of 3 points in intent-conformity and 4 points in argument-quality metrics. Extensive human evaluation supports CoARL's efficacy in generating superior and more context-appropriate responses compared to existing systems, including prominent LLMs like ChatGPT.

Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

TL;DR

Abstract

Paper Structure (40 sections, 5 equations, 3 figures, 6 tables)

This paper contains 40 sections, 5 equations, 3 figures, 6 tables.

Introduction
Motivation:
Our Contribution:
Related Work
Automatic Counterspeech Generation:
Instruction Tuning and RLAIF:
Dataset
Proposed Methodology
Supervised Fine-Tuning:
Reward Model (RM):
Reinforcement Learning:
Experimental Setup
Baselines
Evaluation Metrics
Experimental Results
...and 25 more sections

Figures (3)

Figure 1: Classical methods vs. instruction tuning for counterspeech generation. These examples show that counterspeech generation can be improved by the use of detailed and explicit instructions that allow a model to focus on the different aspects of a given hate speech.
Figure 2: Overview of the three-phased architecture of CoARL. In the first phase (left), CoARL is trained on an auxiliary task of hate speech (HS) explanation generation using a multi-task IT setup. Subsequently, in the second phase (right), task-specific LoRA weights are trained by freezing the model parameters from the previous phase, thus, enabling forward knowledge transfer without catastrophic forgetting. In the final phase (right), the model output is optimized via RL using feedback from a composite reward model consisting of three pre-trained classifiers.
Figure 3: Visual exploration of various attribute distribution present in IntentCONANv2

Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

TL;DR

Abstract

Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

Authors

TL;DR

Abstract

Table of Contents

Figures (3)