Table of Contents
Fetching ...

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

Dung Ha Nguyen, Anh Thi Hoang Nguyen, Kiet Van Nguyen

TL;DR

This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese, and proposes a framework that integrates semi-supervised learning with weak supervision techniques.

Abstract

This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates semi-supervised learning with weak supervision techniques. This approach enhances the quality of training dataset and expands its size while minimizing manual labeling efforts. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms, thereby improving the accuracy and consistency of the training data. Experimental results demonstrate the effectiveness of our weak supervision framework in normalizing Vietnamese text, especially when utilizing Pre-trained Language Models. The proposed framework achieves an impressive F1-score of 82.72% and maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally, it effectively handles undiacritized text under various conditions. This framework significantly enhances natural language normalization quality and improves the accuracy of various NLP tasks, leading to an average accuracy increase of 1-3%.

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

TL;DR

This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese, and proposes a framework that integrates semi-supervised learning with weak supervision techniques.

Abstract

This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates semi-supervised learning with weak supervision techniques. This approach enhances the quality of training dataset and expands its size while minimizing manual labeling efforts. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms, thereby improving the accuracy and consistency of the training data. Experimental results demonstrate the effectiveness of our weak supervision framework in normalizing Vietnamese text, especially when utilizing Pre-trained Language Models. The proposed framework achieves an impressive F1-score of 82.72% and maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally, it effectively handles undiacritized text under various conditions. This framework significantly enhances natural language normalization quality and improves the accuracy of various NLP tasks, leading to an average accuracy increase of 1-3%.
Paper Structure (45 sections, 10 equations, 12 figures, 21 tables)

This paper contains 45 sections, 10 equations, 12 figures, 21 tables.

Figures (12)

  • Figure 1: Preprocessing and Data Generation Workflow for the Framework.
  • Figure 2: NSWs Proportion in ViLexNorm Train, Dev, and Test Dataset.
  • Figure 3: The Distribution of Sentence Length of ViLexNorm Train, Dev, and Test Dataset.
  • Figure 4: Main Components in the Architecture of the Proposed Framework.
  • Figure 5: Token-Level Alignment Tokenization Process of the source ['ca', 'màk', 'hay', 'z', 'qá'] and target sentence ['công an', 'mà', 'hay', 'vậy', 'quá'] (English: The police are always like that) using ViSoBERT tokenizer.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2