Exploring Robustness of Multilingual LLMs on Real-World Noisy Data

Amirhossein Aliakbarzadeh; Lucie Flek; Akbar Karimi

Exploring Robustness of Multilingual LLMs on Real-World Noisy Data

Amirhossein Aliakbarzadeh, Lucie Flek, Akbar Karimi

TL;DR

This work interrogates how real-world spelling errors affect multilingual LLMs by constructing WikiTypo from Wikipedia edit histories and evaluating nine models across six languages on NLI, NER, and IC. It finds that all models are susceptible to noisy input, with robustness strongly influenced by model size, training data, and architecture, notably favoring large mT5 variants over decoder-only models. NLI shows the largest robustness gaps while IC remains relatively stable; English tends to exhibit larger degradation, though gains from noisy training can mitigate these gaps. The study provides a publicly available noisy benchmark and actionable insights for deploying multilingual LLMs in real-world, noisy settings, highlighting that model choice and training strategies crucially shape resilience to typos.

Abstract

Large Language Models (LLMs) are trained on Web data that might contain spelling errors made by humans. But do they become robust to similar real-world noise? In this paper, we investigate the effect of real-world spelling mistakes on the performance of 9 language models, with parameters ranging from 0.2B to 13B, in 3 different NLP tasks, namely Natural Language Inference (NLI), Name Entity Recognition (NER), and Intent Classification (IC). We perform our experiments on 6 different languages and build a dictionary of real-world noise for them using the Wikipedia edit history. We show that the performance gap of the studied models on the clean and noisy test data averaged across all the datasets and languages ranges from 2.3 to 4.3 absolute percentage points. In addition, mT5 models, in general, show more robustness compared to BLOOM, Falcon, and BERT-like models. In particular, mT5 (13B), was the most robust on average overall, across the 3 tasks, and in 4 of the 6 languages.

Exploring Robustness of Multilingual LLMs on Real-World Noisy Data

TL;DR

Abstract

Paper Structure (17 sections, 6 figures, 15 tables)

This paper contains 17 sections, 6 figures, 15 tables.

Introduction
Related Work
WikiTypo: A Collection of Real-World Typos from Wikipedia
Experimental Setup
Datasets and Tasks
Models
Fine-tuning
Results and Findings
Are larger models more robust?
Are different tasks equally sensitive to real-world noise?
How does the performance differ from English to other languages?
Noisy training narrows the gap
Why are models less robust to WikiTypo English noise?
Conclusion
Limitations
...and 2 more sections

Figures (6)

Figure 1: Typos that users make may lead LLMs to misclassify the input sentences. The sentences in the table are clean and noisy test samples of the intent classification data (SNIPS) that were misclassified by the studied models after typical typographical errors from our WikiTypo corpus were inserted.
Figure 2: Training and evaluation losses for NER task on WikiANN dataset. After the second epoch model overfits.
Figure 3: Average gap (in percentage points) between the accuracy of the experimented models on the clean data and the noisy data. The numbers indicate the average gap over all the six languages on SNIPS, Wikiann, and XNLI datasets.
Figure 4: Heatmap of average performance gap over datasets models per language.
Figure 5: Performance gap between clean and noisy test sets of SNIPS (IC), XNLI (NLI) and WikiANN (NER) datasets for English (en), German (de), Spanish (es), French (fr), Hindi (hi), and Turkish (tr) languages.
...and 1 more figures

Exploring Robustness of Multilingual LLMs on Real-World Noisy Data

TL;DR

Abstract

Exploring Robustness of Multilingual LLMs on Real-World Noisy Data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)