Table of Contents
Fetching ...

Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning

Huan Ma, Changqing Zhang, Huazhu Fu, Peilin Zhao, Bingzhe Wu

TL;DR

This work investigates privately deployable content moderation using large language models, arguing that simple discriminative fine-tuning can overfit on limited data. It advocates a Chain-of-Thought–inspired fine-tuning pipeline with weak supervision, data deduplication, and data-quality checks to improve robustness and interpretability. Through experiments on a Chinese six-category moderation task, the authors show that generative LLM fine-tuning with reasoning can outperform strong baselines, with notable gains when applying recheck and data-cleaning strategies, and demonstrate cross-lingual and zero-shot generalization. The results provide practical guidance for private deployment of domain-specific moderation systems, including data construction, training strategies (LoRA-based PEFT), and evaluation across in-distribution and OOD settings. Overall, the paper contributes actionable insights into making privately deployed LLMs effective for content moderation while mitigating overfitting and hallucination risks during fine-tuning.

Abstract

Nowadays, billions of people engage in communication and express their opinions on the internet daily. Unfortunately, not all of these expressions are friendly or compliant, making content moderation an indispensable task. A common approach is to use a discriminative model to classify the content, but this method often requires strict data engineering, otherwise it will face unacceptable overfitting. With the successful development of Large Language Models (LLMs) in recent years, LLM-based methods have become a feasible solution for handling tasks in various domains. Thanks to the knowledge of the foundation models, we can develop more robust privately deployed models with limited data via fine-tuning these foundation models. Moreover, as a generative model, it can provide detailed analysis of the review process, enhancing interpretability. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation. Specifically, we discuss the differences between discriminative and generative models using content moderation as an example. Additionally, we reveal that incorporating reasoning processes during the fine-tuning of LLMs can effectively alleviate overfitting, even if the model is not allowed to directly output reasoning processes during deployment. We present a complete process, from data collection and construction to model training and overfitting elimination, for fine-tuning LLMs in vertical domain deployments. We report the entire research process and the key findings in this paper, hoping to provide valuable experience for researchers who are fine-tuning privately deployed models in their domain-specific research.

Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning

TL;DR

This work investigates privately deployable content moderation using large language models, arguing that simple discriminative fine-tuning can overfit on limited data. It advocates a Chain-of-Thought–inspired fine-tuning pipeline with weak supervision, data deduplication, and data-quality checks to improve robustness and interpretability. Through experiments on a Chinese six-category moderation task, the authors show that generative LLM fine-tuning with reasoning can outperform strong baselines, with notable gains when applying recheck and data-cleaning strategies, and demonstrate cross-lingual and zero-shot generalization. The results provide practical guidance for private deployment of domain-specific moderation systems, including data construction, training strategies (LoRA-based PEFT), and evaluation across in-distribution and OOD settings. Overall, the paper contributes actionable insights into making privately deployed LLMs effective for content moderation while mitigating overfitting and hallucination risks during fine-tuning.

Abstract

Nowadays, billions of people engage in communication and express their opinions on the internet daily. Unfortunately, not all of these expressions are friendly or compliant, making content moderation an indispensable task. A common approach is to use a discriminative model to classify the content, but this method often requires strict data engineering, otherwise it will face unacceptable overfitting. With the successful development of Large Language Models (LLMs) in recent years, LLM-based methods have become a feasible solution for handling tasks in various domains. Thanks to the knowledge of the foundation models, we can develop more robust privately deployed models with limited data via fine-tuning these foundation models. Moreover, as a generative model, it can provide detailed analysis of the review process, enhancing interpretability. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation. Specifically, we discuss the differences between discriminative and generative models using content moderation as an example. Additionally, we reveal that incorporating reasoning processes during the fine-tuning of LLMs can effectively alleviate overfitting, even if the model is not allowed to directly output reasoning processes during deployment. We present a complete process, from data collection and construction to model training and overfitting elimination, for fine-tuning LLMs in vertical domain deployments. We report the entire research process and the key findings in this paper, hoping to provide valuable experience for researchers who are fine-tuning privately deployed models in their domain-specific research.
Paper Structure (19 sections, 2 figures, 7 tables)

This paper contains 19 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Content Moderation with Auditing Processes. GPT-4 can provide complete auditing processes (left), but sometimes it presents limitations (right).
  • Figure 2: F1 Score on OOD datasets.