Chinese Offensive Language Detection:Current Status and Future Directions
Yunze Xiao, Houda Bouamor, Wajdi Zaghouani
TL;DR
This survey tackles the challenge of detecting offensive language in Chinese by surveying benchmarks, datasets, and modeling approaches while highlighting language-specific hurdles—dialectal variation, cultural references, and rapidly evolving neologisms. It catalogs key resources such as COLD, TOXICN, SWSR, and sarcasm datasets, and reviews method families from lexicon-based to pretrained language models and multimodal methods, emphasizing cross-cultural transfer and user-context considerations. The paper identifies critical gaps in labeling reliability, cultural context coverage, and the handling of subversive expressions, proposing a multi-pronged research agenda that includes context-aware models, richer datasets, and collaborative resource sharing. The findings aim to advance robust, culturally aware offensive language detection systems for Chinese, with practical implications for social platforms seeking real-time moderation across diverse linguistic communities.
Abstract
Despite the considerable efforts being made to monitor and regulate user-generated content on social media platforms, the pervasiveness of offensive language, such as hate speech or cyberbullying, in the digital space remains a significant challenge. Given the importance of maintaining a civilized and respectful online environment, there is an urgent and growing need for automatic systems capable of detecting offensive speech in real time. However, developing effective systems for processing languages such as Chinese presents a significant challenge, owing to the language's complex and nuanced nature, which makes it difficult to process automatically. This paper provides a comprehensive overview of offensive language detection in Chinese, examining current benchmarks and approaches and highlighting specific models and tools for addressing the unique challenges of detecting offensive language in this complex language. The primary objective of this survey is to explore the existing techniques and identify potential avenues for further research that can address the cultural and linguistic complexities of Chinese.
