HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Huy Nghiem; Hal Daumé

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Huy Nghiem, Hal Daumé

TL;DR

It is demonstrated that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task.

Abstract

The widespread use of social media necessitates reliable and efficient detection of offensive content to mitigate harmful effects. Although sophisticated models perform well on individual datasets, they often fail to generalize due to varying definitions and labeling of "offensive content." In this paper, we introduce HateCOT, an English dataset with over 52,000 samples from diverse sources, featuring explanations generated by GPT-3.5Turbo and curated by humans. We demonstrate that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task. Additionally, HateCOT facilitates effective K-shot fine-tuning of LLMs with limited data and improves the quality of their explanations, as confirmed by our human evaluation.

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

TL;DR

Abstract

Paper Structure (36 sections, 8 figures, 7 tables)

This paper contains 36 sections, 8 figures, 7 tables.

Introduction
Related Works
Offensive Speech Detection
LLMs in Offensive Speech Classification
Building HateCOT
Data Selection
Datasets for Training.
Datasets for Evaluation.
Obtaining Annotation-Guided Explanation.
Optimization of Synthesized Corpus
Optimization Procedure
Description of Procedure.
Experiment Configurations.
Answer Extraction.
Insights and Augmentation
...and 21 more sections

Figures (8)

Figure 1: Template used to obtain explanations from GPT-3.5-Turbo guided by human-annotated rationales.
Figure 2: Performance resutls of LLMs on test sets in various settings.
Figure 3: Heatmap for the average rating of explanations by finetuned Model (x-axis) and Dataset (y-axis) on 3 criteria from 1 (least) to 5 (very). Overall indicates average scores aggregated over all datasets. Triplets of scores italicized and in bold are those whose $p$-value < 0.05 by one-way ANOVA test that compare ratings of 3 models across the dataset on that row. Italicized-only scores indicate marginal significance ($p$-value $\approx$ 0.07).
Figure 4: Examples drawn from our training corpora showing their native Post, Target and Rationale, along with the corresponding GPT-3.5-Turbo-enhanced explanations. Due to their nature as fragmented annotations, verbatim Rationales are not serviceable explanation, but can serve as guiding signals that leverage GPT's generative capabilities to construct legible passages with detailed justifications.
Figure 5: Template used to prompt LLM for classification inference.
...and 3 more figures

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

TL;DR

Abstract

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)