Table of Contents
Fetching ...

ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization

Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Kiet Van Nguyen

TL;DR

ViSoLex tackles the challenge of non-standard language in Vietnamese social media by delivering an open-source, dual-service platform for NSW lookup and lexical normalization. It combines a multitask normalization model with a Rule Attention Network and weak supervision, augmented by a growing NSW dictionary and GPT-4o-driven dictionary expansion, to reduce labeled-data requirements. The system demonstrates improved F1-score and accuracy over a prior weakly supervised baseline, with particular gains when diacritics are removed, and provides accessible entry points for researchers and non-technical users alike. By making the repository extensible and adaptable to other languages and datasets, ViSoLex contributes a scalable resource for Vietnamese NLP and invites broader applications in lexical normalization research.

Abstract

ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex's architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system's design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system's capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.

ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization

TL;DR

ViSoLex tackles the challenge of non-standard language in Vietnamese social media by delivering an open-source, dual-service platform for NSW lookup and lexical normalization. It combines a multitask normalization model with a Rule Attention Network and weak supervision, augmented by a growing NSW dictionary and GPT-4o-driven dictionary expansion, to reduce labeled-data requirements. The system demonstrates improved F1-score and accuracy over a prior weakly supervised baseline, with particular gains when diacritics are removed, and provides accessible entry points for researchers and non-technical users alike. By making the repository extensible and adaptable to other languages and datasets, ViSoLex contributes a scalable resource for Vietnamese NLP and invites broader applications in lexical normalization research.

Abstract

ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex's architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system's design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system's capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.
Paper Structure (14 sections, 1 equation, 4 figures, 1 table)

This paper contains 14 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: The Architecture of ViSoLex. The diagram illustrates the modular components enabling NSW Lookup and Lexical Normalization services, including their interactions and flow of user inputs.
  • Figure 2: Weak Supervision Training. This figure illustrates the training process of the lexical normalizer, which integrates multitask learning and a Rule Attention Network guided by weak supervision rules to effectively standardize NSWs in Vietnamese social media text.
  • Figure 3: User Interface of NSW LookUp Service. This interface allows users to search for non-standard words and retrieve their standard equivalents, definitions, and examples from the dictionary.
  • Figure 4: User Interface of Lexical Normalization Service. This interface enables users to input sentences with non-standard words and receive fully normalized outputs in real-time.