Table of Contents
Fetching ...

SLM-Mod: Small Language Models Surpass LLMs at Content Moderation

Xianyang Zhan, Agam Goyal, Yilun Chen, Eshwar Chandrasekharan, Koustuv Saha

TL;DR

This work investigates whether open-source small language models fine-tuned with LoRA can outperform larger, closed- and open-source LLMs in community-specific content moderation. Through a large Reddit-based dataset, the authors demonstrate that SLMs achieve higher accuracy and recall in in-domain moderation and maintain strong performance under imbalanced data, while LLMs offer higher precision. They further show that SLMs transfer reasonably well across domains, sometimes outperforming LLMs in cross-domain settings, and reveal stability advantages over closed-source models. The study argues for a shift toward specialist SLMs for scalable, cost-efficient moderation and outlines hybrid deployment strategies that combine automated triage with human review. It also highlights limitations and directions for future work, including continual updating to track evolving community norms and extending to multimodal moderation.

Abstract

Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models in both a zero-shot and few-shot setting. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform zero-shot LLMs at content moderation -- 11.5% higher accuracy and 25.7% higher recall on average across all communities. Moreover, few-shot in-context learning leads to only a marginal increase in the performance of LLMs, still lacking compared to SLMs. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation. Code and models can be found at https://github.com/AGoyal0512/SLM-Mod.

SLM-Mod: Small Language Models Surpass LLMs at Content Moderation

TL;DR

This work investigates whether open-source small language models fine-tuned with LoRA can outperform larger, closed- and open-source LLMs in community-specific content moderation. Through a large Reddit-based dataset, the authors demonstrate that SLMs achieve higher accuracy and recall in in-domain moderation and maintain strong performance under imbalanced data, while LLMs offer higher precision. They further show that SLMs transfer reasonably well across domains, sometimes outperforming LLMs in cross-domain settings, and reveal stability advantages over closed-source models. The study argues for a shift toward specialist SLMs for scalable, cost-efficient moderation and outlines hybrid deployment strategies that combine automated triage with human review. It also highlights limitations and directions for future work, including continual updating to track evolving community norms and extending to multimodal moderation.

Abstract

Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models in both a zero-shot and few-shot setting. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform zero-shot LLMs at content moderation -- 11.5% higher accuracy and 25.7% higher recall on average across all communities. Moreover, few-shot in-context learning leads to only a marginal increase in the performance of LLMs, still lacking compared to SLMs. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation. Code and models can be found at https://github.com/AGoyal0512/SLM-Mod.

Paper Structure

This paper contains 40 sections, 1 equation, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Online Moderation with Language Models. Given a comment from a subreddit r/changemyview, the preceding context, and community rules, we compare LLMs and SLMs on moderation performance.
  • Figure 2: In-domain Moderation Performance. Comparing the performance of SLMs versus LLMs on accuracy, recall, and precision for in-domain content moderation performance. Best performing SLMs outperform LLMs on accuracy and recall across all subreddits, while LLMs outperform SLMs on precision.
  • Figure 3: Impact of content length. Probabilities of the mistakes (FP and FN) made by SLMs and LLMs on varying comment length (in words) in r/changemyview reveals that SLMs tend to over-moderate shorter comments whereas LLMs are more forgiving for the same. The vertical bar indicates median length of comments in r/changemyview at 19 words.
  • Figure 4: Imbalanced Distribution Evaluation. Best performing SLM (Mistral-NeMo-Instruct) and LLM (GPT-4o) on $1\%$, $5\%$, and $10\%$ imbalance-level test split of r/changemyview and r/AskReddit by AUC scores. Error bars represent standard deviation over 30 seeds.
  • Figure 5: Cross-domain Moderation Performance. Comparison of performance of SLMs in terms of accuracy for cross-domain content moderation performance on three target subreddits: r/IAmA, r/askscience, r/movies. Mistral-NeMo-Instruct gives the best cross-domain performance, with 75% accuracy for r/IAmA by the model fine-tuned for r/politics, 65% accuracy for r/askscience by the model fine-tuned for r/science and r/history, and 80% for r/movies by the model fine-tuned for r/aww and r/nba.
  • ...and 3 more figures