A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

Dung Ha Nguyen; Anh Thi Hoang Nguyen; Kiet Van Nguyen

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

Dung Ha Nguyen, Anh Thi Hoang Nguyen, Kiet Van Nguyen

TL;DR

Abstract

This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates semi-supervised learning with weak supervision techniques. This approach enhances the quality of training dataset and expands its size while minimizing manual labeling efforts. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms, thereby improving the accuracy and consistency of the training data. Experimental results demonstrate the effectiveness of our weak supervision framework in normalizing Vietnamese text, especially when utilizing Pre-trained Language Models. The proposed framework achieves an impressive F1-score of 82.72% and maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally, it effectively handles undiacritized text under various conditions. This framework significantly enhances natural language normalization quality and improves the accuracy of various NLP tasks, leading to an average accuracy increase of 1-3%.

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

TL;DR

Abstract

Paper Structure (45 sections, 10 equations, 12 figures, 21 tables)

This paper contains 45 sections, 10 equations, 12 figures, 21 tables.

Introduction
Fundations of Lexical Normalization and Data Labeling
Lexical Normalization
Existing Lexical Normalization Methods
Data Labeling
Existing Data Labeling Approaches
A Weakly Supervised Approach to Data Labeling for Lexical Normalization in Vietnamese Social Media
Datasets and Data Pre-processing
Datasets
Data Preprocessing
Basic Preprocessing
Named Entity Recognition (NER) Pipeline
Word Segmentation
Tokenization
Dataset Description
...and 30 more sections

Figures (12)

Figure 1: Preprocessing and Data Generation Workflow for the Framework.
Figure 2: NSWs Proportion in ViLexNorm Train, Dev, and Test Dataset.
Figure 3: The Distribution of Sentence Length of ViLexNorm Train, Dev, and Test Dataset.
Figure 4: Main Components in the Architecture of the Proposed Framework.
Figure 5: Token-Level Alignment Tokenization Process of the source ['ca', 'màk', 'hay', 'z', 'qá'] and target sentence ['công an', 'mà', 'hay', 'vậy', 'quá'] (English: The police are always like that) using ViSoBERT tokenizer.
...and 7 more figures

Theorems & Definitions (2)

Definition 1
Definition 2

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

TL;DR

Abstract

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (2)