ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?
Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang
TL;DR
This paper examines whether open-source LLMs can effectively generate synthetic toxic data to augment hate-speech detection datasets. It proposes a two-stage approach—prompt engineering followed by supervised fine-tuning with LoRA—to overcome safety alignments and improve data quality, diversity, and realism. Across six open-source models and five datasets, the study finds that fine-tuned models like Mistral can approach, and in some cases narrow the gap to, GPT-4, with data-mixing strategies boosting generalization but risking overfitting or duplication. The results highlight both the promise and practical challenges of deploying synthetic toxic data at scale for real-world content moderation, informing responsible use and future improvements in robustness and fairness.
Abstract
Effective toxic content detection relies heavily on high-quality and diverse data, which serve as the foundation for robust content moderation models. Synthetic data has become a common approach for training models across various NLP tasks. However, its effectiveness remains uncertain for highly subjective tasks like hate speech detection, with previous research yielding mixed results. This study explores the potential of open-source LLMs for harmful data synthesis, utilizing controlled prompting and supervised fine-tuning techniques to enhance data quality and diversity. We systematically evaluated 6 open source LLMs on 5 datasets, assessing their ability to generate diverse, high-quality harmful data while minimizing hallucination and duplication. Our results show that Mistral consistently outperforms other open models, and supervised fine-tuning significantly enhances data reliability and diversity. We further analyze the trade-offs between prompt-based vs. fine-tuned toxic data synthesis, discuss real-world deployment challenges, and highlight ethical considerations. Our findings demonstrate that fine-tuned open source LLMs provide scalable and cost-effective solutions to augment toxic content detection datasets, paving the way for more accessible and transparent content moderation tools.
