ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection
Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu
TL;DR
This work introduces ToxiFrench, the largest native French toxicity dataset (53k+ comments) built via a semi-automated pre-annotation and human verification pipeline to ensure high-quality labels while minimizing manual effort. It reveals that small language models can rival larger models in robustness for toxicity detection and proposes a Chain-of-Thought fine-tuning framework augmented by a Dynamic Weighted Loss (DWL) and Direct Preference Optimization (DPO) to improve faithfulness and performance. The resulting 4B Qwen model (Qwen3-4B) achieves state-of-the-art performance on ToxiFrench, surpassing larger models like GPT-4o in this benchmark and demonstrating notable cross-lingual generalization. This methodology provides a scalable, culturally grounded approach to toxicity detection with strong practical implications for multilingual safety systems and beyond, while acknowledging dataset-specific biases and ethical considerations.
Abstract
Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, human-annotated, large-scale datasets. In this work, we release ToxiFrench, a dataset of 53,622 French online comments together with a balanced benchmark split for systematic evaluation. The dataset is constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification, while ensuring statistical alignment with human-only annotation. We then benchmark a broad range of models and uncover a counterintuitive finding: Small Language Models (SLMs) often surpass larger models in robustness and generalization on this task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a Dynamic Weighted Loss (DWL) that progressively emphasizes the model's final decision and significantly improves faithfulness. Our fine-tuned 4B model (Qwen3-4B) achieves state-of-the-art performance on the benchmark. It improves its balanced accuracy by 10% over its baseline and achieves better performance than GPT-4o and DeepSeek-R1 on our benchmark, while successfully retaining cross-lingual capabilities.
