ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality
Yu-Xiang Luo, Yi-Cheng Lin, Ming-To Chuang, Jia-Hung Chen, I-Ning Tsai, Pei Xing Kiew, Yueh-Hsuan Huang, Chien-Feng Liu, Yu-Chen Chen, Bo-Han Feng, Wenze Ren, Hung-yi Lee
TL;DR
This work tackles the gap in Mandarin spoken-toxicity research by introducing ToxicTone, the largest publicly available dataset with detailed annotations for both the form and source of toxicity and rich prosodic cues. It presents a multimodal detection framework that combines acoustic, linguistic, and emotional encodings to outperform text-only baselines. The study shows that integrating speech-specific cues yields substantial gains in both toxicity detection and toxicity-source classification, with domain trends such as gaming content showing higher prevalence. By providing a robust Mandarin toxicity benchmark and demonstrating effective multimodal fusion, the paper lays a foundation for safer online speech and advances in spoken-language toxicity analysis.
Abstract
Despite extensive research on toxic speech detection in text, a critical gap remains in handling spoken Mandarin audio. The lack of annotated datasets that capture the unique prosodic cues and culturally specific expressions in Mandarin leaves spoken toxicity underexplored. To address this, we introduce ToxicTone -- the largest public dataset of its kind -- featuring detailed annotations that distinguish both forms of toxicity (e.g., profanity, bullying) and sources of toxicity (e.g., anger, sarcasm, dismissiveness). Our data, sourced from diverse real-world audio and organized into 13 topical categories, mirrors authentic communication scenarios. We also propose a multimodal detection framework that integrates acoustic, linguistic, and emotional features using state-of-the-art speech and emotion encoders. Extensive experiments show our approach outperforms text-only and baseline models, underscoring the essential role of speech-specific cues in revealing hidden toxic expressions.
