A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines
Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal
TL;DR
The paper tackles the challenge of evaluating text normalization in NLP by introducing a task-oriented, multi-dimensional framework that jointly considers vocabulary compression, semantic preservation, and downstream impact. The framework combines macro-level metrics (Compression Ratio $CR$ and Stemming Effectiveness Score $SES$) with micro-level fidelity (Average Normalized Levenshtein Distance $ANLD$) and downstream sensitivity (Model Performance Delta $MPD$), operationalized through formulas such as $SES = IRS \times CR$ and $ANLD = average( Levenshtein(orig, stemmed) / |orig| )$. It is validated on English and Bangla with stemmers Snowball, BNLTK, and BanLemma, revealing that high $SES$ can coincide with harmful surface-form distortion (high $ANLD$), while safe, moderate compression can yield downstream gains. The study advocates using $SES$ in conjunction with $ANLD$ as a safety gate and plans to extend the framework to transformer-based models to broaden applicability and robustness.
Abstract
Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream performance.In contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).
