Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
Marco Braga, Gian Carlo Milanese, Gabriella Pasi
TL;DR
This work tackles the problem of context-sensitive text preprocessing by evaluating multiple Large Language Models (LLMs) as dynamic preprocessors for stopword removal, lemmatization, and stemming across six European languages. Using in-context prompts and multilingual prompts, the study compares LLM outputs to traditional baselines and assesses downstream impact on text classification with TF-IDF features and three classifiers, reporting up to $6\%$ improvements in $F_1$ on several English datasets and notable gains in non-English datasets. Key findings show LLMs can replicate stopword removal with up to $97\%$ accuracy, lemmatization up to $82\%$, and stemming up to $74\%$, while stemming remains challenging due to context-dependent variations. The results suggest promising applications of LLM-based preprocessing, especially in low-resource languages where annotated resources for lemmatizers and stemmers are scarce, and the authors provide an open-source pipeline and prompts for broader reuse.$F_1$ is the primary metric used for downstream evaluation, highlighting practical improvements in NLP pipelines when contextual information is leveraged by LLMs.
Abstract
Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.
