Table of Contents
Fetching ...

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Huy Nghiem, Hal Daumé

TL;DR

It is demonstrated that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task.

Abstract

The widespread use of social media necessitates reliable and efficient detection of offensive content to mitigate harmful effects. Although sophisticated models perform well on individual datasets, they often fail to generalize due to varying definitions and labeling of "offensive content." In this paper, we introduce HateCOT, an English dataset with over 52,000 samples from diverse sources, featuring explanations generated by GPT-3.5Turbo and curated by humans. We demonstrate that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task. Additionally, HateCOT facilitates effective K-shot fine-tuning of LLMs with limited data and improves the quality of their explanations, as confirmed by our human evaluation.

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

TL;DR

It is demonstrated that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task.

Abstract

The widespread use of social media necessitates reliable and efficient detection of offensive content to mitigate harmful effects. Although sophisticated models perform well on individual datasets, they often fail to generalize due to varying definitions and labeling of "offensive content." In this paper, we introduce HateCOT, an English dataset with over 52,000 samples from diverse sources, featuring explanations generated by GPT-3.5Turbo and curated by humans. We demonstrate that pretraining on HateCOT significantly enhances the performance of open-source Large Language Models on three benchmark datasets for offensive content detection in both zero-shot and few-shot settings, despite differences in domain and task. Additionally, HateCOT facilitates effective K-shot fine-tuning of LLMs with limited data and improves the quality of their explanations, as confirmed by our human evaluation.
Paper Structure (36 sections, 8 figures, 7 tables)

This paper contains 36 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Template used to obtain explanations from GPT-3.5-Turbo guided by human-annotated rationales.
  • Figure 2: Performance resutls of LLMs on test sets in various settings.
  • Figure 3: Heatmap for the average rating of explanations by finetuned Model (x-axis) and Dataset (y-axis) on 3 criteria from 1 (least) to 5 (very). Overall indicates average scores aggregated over all datasets. Triplets of scores italicized and in bold are those whose $p$-value < 0.05 by one-way ANOVA test that compare ratings of 3 models across the dataset on that row. Italicized-only scores indicate marginal significance ($p$-value $\approx$ 0.07).
  • Figure 4: Examples drawn from our training corpora showing their native Post, Target and Rationale, along with the corresponding GPT-3.5-Turbo-enhanced explanations. Due to their nature as fragmented annotations, verbatim Rationales are not serviceable explanation, but can serve as guiding signals that leverage GPT's generative capabilities to construct legible passages with detailed justifications.
  • Figure 5: Template used to prompt LLM for classification inference.
  • ...and 3 more figures