Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection
Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu
TL;DR
This work interrogates how large language models perform in implicit hate speech detection and in estimating their own uncertainty. It jointly evaluates primary classification and calibration across three uncertainty-estimation methods (verbal-based, consistency-based, and logit-based) and multiple prompt patterns on three datasets with three models. The authors reveal two key limitations: (i) excessive sensitivity to groups/topics related to fairness, causing benign statements to be misclassified as hate speech, and (ii) highly concentrated confidence scores that disregard dataset difficulty, making calibration heavily rely on classification accuracy. They also find that different prompt patterns affect performance but no pattern uniformly dominates, suggesting a need for cautious model optimization and potential ensemble approaches for more reliable fairness outcomes.
Abstract
The fairness and trustworthiness of Large Language Models (LLMs) are receiving increasing attention. Implicit hate speech, which employs indirect language to convey hateful intentions, occupies a significant portion of practice. However, the extent to which LLMs effectively address this issue remains insufficiently examined. This paper delves into the capability of LLMs to detect implicit hate speech (Classification Task) and express confidence in their responses (Calibration Task). Our evaluation meticulously considers various prompt patterns and mainstream uncertainty estimation methods. Our findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech. (2) LLMs' confidence scores for each method excessively concentrate on a fixed range, remaining unchanged regardless of the dataset's complexity. Consequently, the calibration performance is heavily reliant on primary classification accuracy. These discoveries unveil new limitations of LLMs, underscoring the need for caution when optimizing models to ensure they do not veer towards extremes. This serves as a reminder to carefully consider sensitivity and confidence in the pursuit of model fairness.
