Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages
Shohei Higashiyama, Masao Utiyama
TL;DR
This work conducts a comprehensive, boundary-aware evaluation of lexical normalization for unsegmented languages, centering on Japanese with a large, multi-domain LN dataset (JMLN) and a suite of Transformer-based LN methods across encoder-only, encoder–decoder, and decoder-only architectures. It introduces a novel encoder-based infilling approach and multiple generative strategies, and assesses them on Japanese and Thai datasets to reveal architecture- and data-size-dependent trade-offs in precision, recall, and throughput. Key findings show encoder-only models offer high throughput, decoder-only models excel in recall with competitive precision, and larger training sets (4k–8k) yield substantial gains though domain novelty (e.g., typos) remains challenging. The study provides multi-perspective insights into model behavior, data requirements, and domain effects, contributing a valuable resource (JMLN) and a nuanced benchmark for LN in unsegmented languages with practical implications for downstream NLP tasks.
Abstract
Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.
