Table of Contents
Fetching ...

Automatic Textual Normalization for Hate Speech Detection

Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Nguyet Thi Nguyen, Khanh Thanh-Duy Ho, Kiet Van Nguyen

TL;DR

This work tackles non-standard Vietnamese text in social media by introducing a Seq2Seq-based textual normalization pipeline and a 2,181-pair annotated dataset with high inter-annotator agreement. It evaluates three S2S variants (S2S, S2SSelf, S2SMulti) at the token level and finds that while normalization alone achieves sub-70% accuracy, it modestly improves hate speech detection performance by roughly 2%. The dataset, created from ViHSD sources, is carefully annotated and analyzed for annotation errors, yielding valuable insights into normalization challenges such as sentence length and token ambiguity. The study demonstrates the potential of lexical normalization to enhance downstream NLP tasks and outlines future work to scale the dataset and explore additional models and applications beyond HSD.

Abstract

Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.

Automatic Textual Normalization for Hate Speech Detection

TL;DR

This work tackles non-standard Vietnamese text in social media by introducing a Seq2Seq-based textual normalization pipeline and a 2,181-pair annotated dataset with high inter-annotator agreement. It evaluates three S2S variants (S2S, S2SSelf, S2SMulti) at the token level and finds that while normalization alone achieves sub-70% accuracy, it modestly improves hate speech detection performance by roughly 2%. The dataset, created from ViHSD sources, is carefully annotated and analyzed for annotation errors, yielding valuable insights into normalization challenges such as sentence length and token ambiguity. The study demonstrates the potential of lexical normalization to enhance downstream NLP tasks and outlines future work to scale the dataset and explore additional models and applications beyond HSD.

Abstract

Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.
Paper Structure (19 sections, 3 equations, 2 figures, 7 tables)

This paper contains 19 sections, 3 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Effect of sentence length on the error rate of models.
  • Figure 2: Automatic textual normalization for Hate Speech Detection.