Exploring Robustness of Multilingual LLMs on Real-World Noisy Data
Amirhossein Aliakbarzadeh, Lucie Flek, Akbar Karimi
TL;DR
This work interrogates how real-world spelling errors affect multilingual LLMs by constructing WikiTypo from Wikipedia edit histories and evaluating nine models across six languages on NLI, NER, and IC. It finds that all models are susceptible to noisy input, with robustness strongly influenced by model size, training data, and architecture, notably favoring large mT5 variants over decoder-only models. NLI shows the largest robustness gaps while IC remains relatively stable; English tends to exhibit larger degradation, though gains from noisy training can mitigate these gaps. The study provides a publicly available noisy benchmark and actionable insights for deploying multilingual LLMs in real-world, noisy settings, highlighting that model choice and training strategies crucially shape resilience to typos.
Abstract
Large Language Models (LLMs) are trained on Web data that might contain spelling errors made by humans. But do they become robust to similar real-world noise? In this paper, we investigate the effect of real-world spelling mistakes on the performance of 9 language models, with parameters ranging from 0.2B to 13B, in 3 different NLP tasks, namely Natural Language Inference (NLI), Name Entity Recognition (NER), and Intent Classification (IC). We perform our experiments on 6 different languages and build a dictionary of real-world noise for them using the Wikipedia edit history. We show that the performance gap of the studied models on the clean and noisy test data averaged across all the datasets and languages ranges from 2.3 to 4.3 absolute percentage points. In addition, mT5 models, in general, show more robustness compared to BLOOM, Falcon, and BERT-like models. In particular, mT5 (13B), was the most robust on average overall, across the 3 tasks, and in 4 of the 6 languages.
