Table of Contents
Fetching ...

Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky

Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, Xiaowei Yang

TL;DR

This paper investigates whether open-weight LLMs are ready for social media moderation by evaluating seven state-of-the-art models on Bluesky posts against Bluesky Moderation Service labels and human annotations. Using zero-shot prompts and no fine-tuning, open-weight LLMs show performance that overlaps with proprietary models, achieving sensitivity in the ~81–97% range and specificity in the ~91–100% range. Performance varies by harm category: rudeness tends to yield higher specificity, while intolerance and threats exhibit different sensitivity-specificity dynamics across models. The results support privacy-preserving, on-device moderation and offer insights for designing platform-scale and personalized moderation pipelines, while highlighting the need for clearer taxonomies and calibrated decision thresholds.

Abstract

As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.

Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky

TL;DR

This paper investigates whether open-weight LLMs are ready for social media moderation by evaluating seven state-of-the-art models on Bluesky posts against Bluesky Moderation Service labels and human annotations. Using zero-shot prompts and no fine-tuning, open-weight LLMs show performance that overlaps with proprietary models, achieving sensitivity in the ~81–97% range and specificity in the ~91–100% range. Performance varies by harm category: rudeness tends to yield higher specificity, while intolerance and threats exhibit different sensitivity-specificity dynamics across models. The results support privacy-preserving, on-device moderation and offer insights for designing platform-scale and personalized moderation pipelines, while highlighting the need for clearer taxonomies and calibrated decision thresholds.

Abstract

As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
Paper Structure (18 sections, 3 figures, 3 tables)

This paper contains 18 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The CDF of labeler reaction latency
  • Figure 2: Inter-rater agreement matrices of the rude, intolerant, and threat tests.
  • Figure 3: Inter-rater agreement matrices of the three tests on posts labeled unanimously by human moderators.