Table of Contents
Fetching ...

Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

Shohei Higashiyama, Masao Utiyama

TL;DR

This work conducts a comprehensive, boundary-aware evaluation of lexical normalization for unsegmented languages, centering on Japanese with a large, multi-domain LN dataset (JMLN) and a suite of Transformer-based LN methods across encoder-only, encoder–decoder, and decoder-only architectures. It introduces a novel encoder-based infilling approach and multiple generative strategies, and assesses them on Japanese and Thai datasets to reveal architecture- and data-size-dependent trade-offs in precision, recall, and throughput. Key findings show encoder-only models offer high throughput, decoder-only models excel in recall with competitive precision, and larger training sets (4k–8k) yield substantial gains though domain novelty (e.g., typos) remains challenging. The study provides multi-perspective insights into model behavior, data requirements, and domain effects, contributing a valuable resource (JMLN) and a nuanced benchmark for LN in unsegmented languages with practical implications for downstream NLP tasks.

Abstract

Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.

Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

TL;DR

This work conducts a comprehensive, boundary-aware evaluation of lexical normalization for unsegmented languages, centering on Japanese with a large, multi-domain LN dataset (JMLN) and a suite of Transformer-based LN methods across encoder-only, encoder–decoder, and decoder-only architectures. It introduces a novel encoder-based infilling approach and multiple generative strategies, and assesses them on Japanese and Thai datasets to reveal architecture- and data-size-dependent trade-offs in precision, recall, and throughput. Key findings show encoder-only models offer high throughput, decoder-only models excel in recall with competitive precision, and larger training sets (4k–8k) yield substantial gains though domain novelty (e.g., typos) remains challenging. The study provides multi-perspective insights into model behavior, data requirements, and domain effects, contributing a valuable resource (JMLN) and a nuanced benchmark for LN in unsegmented languages with practical implications for downstream NLP tasks.

Abstract

Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.

Paper Structure

This paper contains 62 sections, 1 equation, 4 figures, 23 tables.

Figures (4)

  • Figure 1: Flow of our detect&infill approach for an input text "ついったみてる," which means "(I'm) looking at Twitter." "M" and "S" represent the MASK and SEP token, respectively. The original characters "ついった" follow the SEP token, but are omitted in the Figure.
  • Figure 2: JMLN test results for each training data size.
  • Figure 3: Instruction prompts for decoder-only models with Plain (top), Struct (middle), and Span approaches (bottom).
  • Figure 4: Plot of F${}_{0.5}$ scores in Table\ref{['tab:res_ja']} for each model series---T5, mT5, Llama-3.2, Qwen2.5, and Sarashina2.2. The scores for the Struct and Span approaches are shown with solid and dotted lines, respectively.