Research on Violent Text Detection System Based on BERT-fasttext Model
Yongsheng Yang, Xiaoying Wang
TL;DR
The paper tackles violent text detection in online environments by proposing a BERT-fasttext fusion that combines BERT's contextual language understanding with FastText's efficient text classification. It introduces a keyword extraction component using a $\chi^{2}$-FPN algorithm, and a hybrid rule-language model that leverages $n$-gram context to constrain rules. Feature selection relies on multiple statistical criteria, including MI, IG, and $\chi^{2}$, to improve discriminative power. Experimental results on a hate speech dataset show that the BERT-fasttext model achieves top performance (e.g., Acc≈87.6%, F1≈86.6%), outperforming individual baselines and suggesting practical benefits for scalable content moderation and the development of domain-specific Chinese cyber-violence corpora.
Abstract
In the digital age of today, the internet has become an indispensable platform for people's lives, work, and information exchange. However, the problem of violent text proliferation in the network environment has arisen, which has brought about many negative effects. In view of this situation, it is particularly important to build an effective system for cutting off violent text. The study of violent text cutting off based on the BERT-fasttext model has significant meaning. BERT is a pre-trained language model with strong natural language understanding ability, which can deeply mine and analyze text semantic information; Fasttext itself is an efficient text classification tool with low complexity and good effect, which can quickly provide basic judgments for text processing. By combining the two and applying them to the system for cutting off violent text, on the one hand, it can accurately identify violent text, and on the other hand, it can efficiently and reasonably cut off the content, preventing harmful information from spreading freely on the network. Compared with the single BERT model and fasttext, the accuracy was improved by 0.7% and 0.8%, respectively. Through this model, it is helpful to purify the network environment, maintain the health of network information, and create a positive, civilized, and harmonious online communication space for netizens, driving the development of social networking, information dissemination, and other aspects in a more benign direction.
