Table of Contents
Fetching ...

Probing Association Biases in LLM Moderation Over-Sensitivity

Yuxin Wang, Botao Yu, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi

Abstract

Large Language Models are widely used for content moderation but often present certain over-sensitivity, leading to misclassification of benign content and rejecting safe user commands. While previous research attributes this issue primarily to the presence of explicit offensive triggers, we statistically reveal a deeper connection beyond token level: When behaving over-sensitively, particularly on decontextualized statements, LLMs exhibit systematic topic-toxicity association patterns that go beyond explicit offensive triggers. To characterize these patterns, we propose Topic Association Analysis, a behavior-based probe that elicits short contextual scenarios for benign inputs and quantifies topic amplification between the scenario and the original comment. Across multiple LLMs and large-scale data, we find that more advanced models (e.g., GPT-4 Turbo) show stronger topic-association skew in false-positive cases despite lower overall false-positive rates. Moreover, via controlled prefix interventions, we show that topic cues can measurably shift false-positive rates, indicating that topic framing is decision-relevant. These results suggest that mitigating over-sensitivity may require addressing learned topic associations in addition to keyword-based filtering.

Probing Association Biases in LLM Moderation Over-Sensitivity

Abstract

Large Language Models are widely used for content moderation but often present certain over-sensitivity, leading to misclassification of benign content and rejecting safe user commands. While previous research attributes this issue primarily to the presence of explicit offensive triggers, we statistically reveal a deeper connection beyond token level: When behaving over-sensitively, particularly on decontextualized statements, LLMs exhibit systematic topic-toxicity association patterns that go beyond explicit offensive triggers. To characterize these patterns, we propose Topic Association Analysis, a behavior-based probe that elicits short contextual scenarios for benign inputs and quantifies topic amplification between the scenario and the original comment. Across multiple LLMs and large-scale data, we find that more advanced models (e.g., GPT-4 Turbo) show stronger topic-association skew in false-positive cases despite lower overall false-positive rates. Moreover, via controlled prefix interventions, we show that topic cues can measurably shift false-positive rates, indicating that topic framing is decision-relevant. These results suggest that mitigating over-sensitivity may require addressing learned topic associations in addition to keyword-based filtering.

Paper Structure

This paper contains 34 sections, 4 equations, 24 figures, 8 tables.

Figures (24)

  • Figure 1: Illustration of how scenario elicitation can surface systematic differences in topic associations between non-toxic vs. toxic judgments for the same decontextualized comment. We treat the elicited scenario as a behavioral probe rather than a faithful rationale.
  • Figure 2: Numbers of false positives predicted by GPT-4o of two benign comments groups (w/ or w/o offensive terms) under two experiment settings: Binary and Rating (threshold $=4$). OT is the short for offensive terms. The label on each column is the false positive rate (FPR) of its corresponding category.
  • Figure 3: The FPRs of different LLMs on benign comments in Civil Comments w/ and w/o offensive terms.
  • Figure 4: Vagueness in GPT's self-explanations for their false positive judgment under example-provided prompt. GPT tend to output generic and template-like language.
  • Figure 5: Our framework for analyzing Topic Association Biases.
  • ...and 19 more figures