Table of Contents
Fetching ...

Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection

Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu

TL;DR

This work interrogates how large language models perform in implicit hate speech detection and in estimating their own uncertainty. It jointly evaluates primary classification and calibration across three uncertainty-estimation methods (verbal-based, consistency-based, and logit-based) and multiple prompt patterns on three datasets with three models. The authors reveal two key limitations: (i) excessive sensitivity to groups/topics related to fairness, causing benign statements to be misclassified as hate speech, and (ii) highly concentrated confidence scores that disregard dataset difficulty, making calibration heavily rely on classification accuracy. They also find that different prompt patterns affect performance but no pattern uniformly dominates, suggesting a need for cautious model optimization and potential ensemble approaches for more reliable fairness outcomes.

Abstract

The fairness and trustworthiness of Large Language Models (LLMs) are receiving increasing attention. Implicit hate speech, which employs indirect language to convey hateful intentions, occupies a significant portion of practice. However, the extent to which LLMs effectively address this issue remains insufficiently examined. This paper delves into the capability of LLMs to detect implicit hate speech (Classification Task) and express confidence in their responses (Calibration Task). Our evaluation meticulously considers various prompt patterns and mainstream uncertainty estimation methods. Our findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech. (2) LLMs' confidence scores for each method excessively concentrate on a fixed range, remaining unchanged regardless of the dataset's complexity. Consequently, the calibration performance is heavily reliant on primary classification accuracy. These discoveries unveil new limitations of LLMs, underscoring the need for caution when optimizing models to ensure they do not veer towards extremes. This serves as a reminder to carefully consider sensitivity and confidence in the pursuit of model fairness.

Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection

TL;DR

This work interrogates how large language models perform in implicit hate speech detection and in estimating their own uncertainty. It jointly evaluates primary classification and calibration across three uncertainty-estimation methods (verbal-based, consistency-based, and logit-based) and multiple prompt patterns on three datasets with three models. The authors reveal two key limitations: (i) excessive sensitivity to groups/topics related to fairness, causing benign statements to be misclassified as hate speech, and (ii) highly concentrated confidence scores that disregard dataset difficulty, making calibration heavily rely on classification accuracy. They also find that different prompt patterns affect performance but no pattern uniformly dominates, suggesting a need for cautious model optimization and potential ensemble approaches for more reliable fairness outcomes.

Abstract

The fairness and trustworthiness of Large Language Models (LLMs) are receiving increasing attention. Implicit hate speech, which employs indirect language to convey hateful intentions, occupies a significant portion of practice. However, the extent to which LLMs effectively address this issue remains insufficiently examined. This paper delves into the capability of LLMs to detect implicit hate speech (Classification Task) and express confidence in their responses (Calibration Task). Our evaluation meticulously considers various prompt patterns and mainstream uncertainty estimation methods. Our findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech. (2) LLMs' confidence scores for each method excessively concentrate on a fixed range, remaining unchanged regardless of the dataset's complexity. Consequently, the calibration performance is heavily reliant on primary classification accuracy. These discoveries unveil new limitations of LLMs, underscoring the need for caution when optimizing models to ensure they do not veer towards extremes. This serves as a reminder to carefully consider sensitivity and confidence in the pursuit of model fairness.
Paper Structure (30 sections, 2 equations, 11 figures, 6 tables)

This paper contains 30 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Precision and recall of different LLMs (distinguished by colors) with various prompt patterns (distinguished by shapes) in hate speech detection. The recall is significantly higher than the precision for LLaMA-2-7b and Mixtral-8x7b on both the Latent Hatred and SBIC datasets, indicating that LLMs may misjudge benign expressions as hate speech. This over-sensitivity arises from the presence of sensitive groups and topics within benign expressions.
  • Figure 2: The best-performing uncertainty estimation method in different scenarios categorized by the model's output token logit and primary classification performance. Logit-based confidence scores achieve the best AUC in all scenarios, while the ECE for each method varies across scenarios.
  • Figure 3: The comparison of the ROC curve.
  • Figure 4: The figure showcases the relationship between AUC and the ensemble number.
  • Figure 5: The ECE performance of LLaMA-2-7b on the SBIC dataset shows that the verbal-based confidence is mainly concentrated in the 70%-80% range, around the overall accuracy of 77%, thus achieving the best ECE.
  • ...and 6 more figures