Evaluating Large Language Models for Detecting Antisemitism
Jay Patel, Hrudayangam Mehta, Jeremy Blackburn
TL;DR
The paper systematically evaluates eight open-source LLMs for antisemitism detection using in-context IHRA-based guidelines and introduces Guided-CoT prompting to align model reasoning with moderation policies. It demonstrates that Guided-CoT improves $F1$-scores and reduces refusals across models and decoding strategies, with Llama 3.1 70B attaining strong performance relative to a fine-tuned GPT-3.5. The work also introduces quantitative metrics to compare LLM explanations and semantic divergence (SCMD) across models, uncovering paradoxical and model-dependent behaviors that inform reliability and interpretability audits. While offering practical insights for policy-guided moderation, the study acknowledges limitations in dataset scope (Twitter, English) and calls for broader, multilingual, and platform-diverse evaluations to ensure robust, real-world applicability.
Abstract
Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition. We also study how LLMs understand and explain their decisions given a moderation policy as a guideline. First, we explore various prompting techniques and design a new CoT-like prompt, Guided-CoT, and find that injecting domain-specific thoughts increases performance and utility. Guided-CoT handles the in-context policy well, improving performance and utility by reducing refusals across all evaluated models, regardless of decoding configuration, model size, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability. Code and resources available at: https://github.com/idramalab/quantify-llm-explanations
