Table of Contents
Fetching ...

Selective Demonstration Retrieval for Improved Implicit Hate Speech Detection

Yumin Kim, Hwanhee Lee

TL;DR

This paper tackles implicit hate speech detection by addressing context sensitivity and dataset subjectivity through Adaptive Retrieval-based In-context Learning for Implicit Hate Speech Detection (ARIIHA). The approach combines two retrieval-based in-context learning strategies—RetICL using BM25 and Target-prioritized RetICL—with adaptive thresholds to select demonstrations that emphasize either target-group similarity or lexical relevance, while mitigating shortcut cues. The authors optimize a similarity threshold and demonstrate, on the Implicit Hate Corpus, that ARIIHA outperforms baselines, including CoT-based methods, with notable reductions in over-sensitivity and robust few-shot performance. The work advances retrieval-based prompting for safety-critical NLP tasks and offers a practical, tuning-free solution that improves reliability in hate speech detection across varying texts and demographics.

Abstract

Hate speech detection is a crucial area of research in natural language processing, essential for ensuring online community safety. However, detecting implicit hate speech, where harmful intent is conveyed in subtle or indirect ways, remains a major challenge. Unlike explicit hate speech, implicit expressions often depend on context, cultural subtleties, and hidden biases, making them more challenging to identify consistently. Additionally, the interpretation of such speech is influenced by external knowledge and demographic biases, resulting in varied detection results across different language models. Furthermore, Large Language Models often show heightened sensitivity to toxic language and references to vulnerable groups, which can lead to misclassifications. This over-sensitivity results in false positives (incorrectly identifying harmless statements as hateful) and false negatives (failing to detect genuinely harmful content). Addressing these issues requires methods that not only improve detection precision but also reduce model biases and enhance robustness. To address these challenges, we propose a novel method, which utilizes in-context learning without requiring model fine-tuning. By adaptively retrieving demonstrations that focus on similar groups or those with the highest similarity scores, our approach enhances contextual comprehension. Experimental results show that our method outperforms current state-of-the-art techniques. Implementation details and code are available at TBD.

Selective Demonstration Retrieval for Improved Implicit Hate Speech Detection

TL;DR

This paper tackles implicit hate speech detection by addressing context sensitivity and dataset subjectivity through Adaptive Retrieval-based In-context Learning for Implicit Hate Speech Detection (ARIIHA). The approach combines two retrieval-based in-context learning strategies—RetICL using BM25 and Target-prioritized RetICL—with adaptive thresholds to select demonstrations that emphasize either target-group similarity or lexical relevance, while mitigating shortcut cues. The authors optimize a similarity threshold and demonstrate, on the Implicit Hate Corpus, that ARIIHA outperforms baselines, including CoT-based methods, with notable reductions in over-sensitivity and robust few-shot performance. The work advances retrieval-based prompting for safety-critical NLP tasks and offers a practical, tuning-free solution that improves reliability in hate speech detection across varying texts and demographics.

Abstract

Hate speech detection is a crucial area of research in natural language processing, essential for ensuring online community safety. However, detecting implicit hate speech, where harmful intent is conveyed in subtle or indirect ways, remains a major challenge. Unlike explicit hate speech, implicit expressions often depend on context, cultural subtleties, and hidden biases, making them more challenging to identify consistently. Additionally, the interpretation of such speech is influenced by external knowledge and demographic biases, resulting in varied detection results across different language models. Furthermore, Large Language Models often show heightened sensitivity to toxic language and references to vulnerable groups, which can lead to misclassifications. This over-sensitivity results in false positives (incorrectly identifying harmless statements as hateful) and false negatives (failing to detect genuinely harmful content). Addressing these issues requires methods that not only improve detection precision but also reduce model biases and enhance robustness. To address these challenges, we propose a novel method, which utilizes in-context learning without requiring model fine-tuning. By adaptively retrieving demonstrations that focus on similar groups or those with the highest similarity scores, our approach enhances contextual comprehension. Experimental results show that our method outperforms current state-of-the-art techniques. Implementation details and code are available at TBD.

Paper Structure

This paper contains 11 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overall pipeline of our proposed ARIIHA approach. Despite the presence of sensitive words in the input text, colored in red, ARIIHA accurately detects the correct label while mitigating over-sensitivity.
  • Figure 2: Performance graph of Qwen2.5-7B-Instruct across varying BM25 similarity score thresholds, with an optimal threshold identified at 10.