Table of Contents
Fetching ...

Probing Critical Learning Dynamics of PLMs for Hate Speech Detection

Sarah Masud, Mohammad Aflah Khan, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty

TL;DR

The paper investigates how critical learning dynamics of pretrained language models influence hate speech detection, examining pretraining seeds, intermediate checkpoints, data recency, finetuning layer choices, and classifier head complexity across seven English datasets. It finds that early pretraining checkpoints often yield peak downstream performance, newer pretraining data provides limited gains, and higher layers near the classifier head are typically most informative for finetuning, with notable exceptions for multilingual models like mBERT. The study challenges the assumption that domain-specific PLMs consistently outperform general-purpose models, showing that a general model with a sufficiently complex classification head can match or exceed domain-specific performance, and highlights the need for dynamic, regularly updated benchmarking datasets. Practical recommendations include reporting results over multiple seeds, leveraging early checkpoints to save compute, and prioritizing targeted finetuning of higher layers, while encouraging dynamic evaluation and broader language coverage in hate speech benchmarks.

Abstract

Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs' use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.

Probing Critical Learning Dynamics of PLMs for Hate Speech Detection

TL;DR

The paper investigates how critical learning dynamics of pretrained language models influence hate speech detection, examining pretraining seeds, intermediate checkpoints, data recency, finetuning layer choices, and classifier head complexity across seven English datasets. It finds that early pretraining checkpoints often yield peak downstream performance, newer pretraining data provides limited gains, and higher layers near the classifier head are typically most informative for finetuning, with notable exceptions for multilingual models like mBERT. The study challenges the assumption that domain-specific PLMs consistently outperform general-purpose models, showing that a general model with a sufficiently complex classification head can match or exceed domain-specific performance, and highlights the need for dynamic, regularly updated benchmarking datasets. Practical recommendations include reporting results over multiple seeds, leveraging early checkpoints to save compute, and prioritizing targeted finetuning of higher layers, while encouraging dynamic evaluation and broader language coverage in hate speech benchmarks.

Abstract

Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs' use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.
Paper Structure (16 sections, 9 figures, 13 tables)

This paper contains 16 sections, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Research Overview: The study comprises five research questions (RQs) to empirically analyze the pretraining and finetuning strategies for PLM variants employed for hate detection. A typical PLM-inspired pipeline involves working with one or more checkpoints, i.e., PLM model weights obtained after pretraining. The checkpoint is then finetuned for downstream tasks by keeping one or more layers of PLM trainable along with a trainable classification head (CH). Finally, the PLM + CH generates predictions on incoming test samples.
  • Figure 2: RQ3: Macro F1 on different datasets finetuned with an MLP classifier on RoBERTa variants. The variants employed are from June 2019 ($R_{J19}$), October 2022 ($R_{O22}$), and December 2022 ($R_{D22}$). Each variant is trained on a training corpus from Wikipedia, and Common-Crawl is curated and updated before the date associated with the model. $R_{J19}$ is the original RoBERTa model and $R_{O22}$ and $R_{D22}$ are its more recent variants.
  • Figure 3: RQ4: (a) Dynabench and (b) OLID -- Descriptive statistics of macro F1 when finetuning on top of individual layers of the BERT-variant highlighting the layer ($L_i$) that on average over MLP seeds ($ms$) leads to minimum and maximum macro F1. Here, the $L_i$ is trainable while other layers are frozen. (c) Dynabench and (d) OLID -- Descriptive statistics of macro F1 when finetuning while constraining a region of layers to be frozen (Suffix F) or non-frozen while all others are frozen (Suffix NF) for different BERT-variant highlighting the region ($R_i$) that on average over MLP seeds ($ms$) leads to minimum and maximum macro F1.
  • Figure 4: RQ4: Percentage distribution of best and worst performing regions across datasets. The divisions on each bar enlist the % of datasets where the given configuration performs best (a) or worst (b) for a BERT-variant. Combined captures the overall trend across all BERT-variants and datasets. Region $R_1$ includes layers $L_1$ to $L_3$, $R_2$ from $L_4$ to $L_6$, $R_3$ from $L_7$ to $L_{9}$ and $R_4$ from $L_{10}$ to $L_{12}$. Suffix $F$ implies that the region was frozen while other regions were trainable, and the $NF$ suffix implies all other regions were frozen while only that region was trainable.
  • Figure 5: RQ5: Macro F1 scores (averaged over MLP seeds $ms$) for (a) Dynabench and (b) OLID datasets employing BERT-variants (BERT, BERTweet, HateBERT, and mBERT). Classification heads of varying complexity (simple, medium, and complex) are utilized to capture their effect on BERT-variants employed for hate detection.
  • ...and 4 more figures