Table of Contents
Fetching ...

Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

Zixiang Xu, Yanbo Wang, Yue Huang, Xiuying Chen, Jieyu Zhao, Meng Jiang, Xiangliang Zhang

TL;DR

This work tackles the challenge of cross-lingual weaknesses in multilingual large language models by introducing a beam-search–based probing pipeline guided by LLM-based simulation to automatically generate bilingual question pairs. The authors build a sizable 16-language dataset (over 6,000 bilingual pairs) and demonstrate that state-of-the-art models exhibit substantial accuracy declines in target languages despite near-perfect English performance. A key finding is that linguistic similarity strongly shapes cross-lingual weaknesses and transfer, with fine-tuning on a language benefiting closely related languages more. The approach offers a practical diagnostic tool for targeted cross-lingual improvement and data augmentation, providing a solid foundation for more robust multilingual LLMs.

Abstract

Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.

Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

TL;DR

This work tackles the challenge of cross-lingual weaknesses in multilingual large language models by introducing a beam-search–based probing pipeline guided by LLM-based simulation to automatically generate bilingual question pairs. The authors build a sizable 16-language dataset (over 6,000 bilingual pairs) and demonstrate that state-of-the-art models exhibit substantial accuracy declines in target languages despite near-perfect English performance. A key finding is that linguistic similarity strongly shapes cross-lingual weaknesses and transfer, with fine-tuning on a language benefiting closely related languages more. The approach offers a practical diagnostic tool for targeted cross-lingual improvement and data augmentation, providing a solid foundation for more robust multilingual LLMs.

Abstract

Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.

Paper Structure

This paper contains 22 sections, 10 equations, 33 figures, 6 tables.

Figures (33)

  • Figure 1: An example of an English-Chinese question pair discovered by our search methodology (where the Chinese question is semantically equivalent to the English) highlights the cross-lingual performance gap: even GPT-4o, despite its strong multilingual capabilities, provides the correct answer in English but gives an incorrect response in Chinese.
  • Figure 2: The overview of the proposed methodology for generating questions that precisely challenge the cross-lingual capabilities of LLMs. As depicted, the pipeline initiates with sampling English questions and creating bilingual pairs. Iterative perturbation, driven by a beam search strategy and guided by LLM-based simulation scores, refines these pairs to maximize performance divergence between English and the target language. The resulting candidate list of question pairs is designed to highlight inherent cross-lingual weaknesses in LLMs.
  • Figure 3: Evaluation of 10 models on our generated 6,600 bilingual pairs across 16 languages. While all models achieve nearly 100% accuracy in English, most experience an average accuracy drop of over 50% in the target languages. Even state-of-the-art multilingual models like GPT-4o and Claude-3.5-sonnet exhibit significant cross-lingual weaknesses.
  • Figure 4: Analysis of question conversion rates and generation costs across 16 languages based on all pairs in our candidate list. The bar chart (red) shows question conversion rates for different languages, while the line chart (purple) represents cost of generating a single question. Notably, in most languages, identifying a bilingual pair that exposes cross-lingual weaknesses costs less than $0.05. However, for languages structurally and lexically closer to English, such as French and Spanish, finding weaknesses becomes significantly harder, leading to higher costs.
  • Figure 5: Performance of LLMs on our generated English-Chinese pairs. Even smaller models like Gemma-2-9B and Llama-3.1-8B achieve perfect accuracy in English, while more than half of the models score below 50% in Chinese. Despite their strong multilingual capabilities, GPT-4o and Claude-3.5-sonnet still exhibit over a 30% accuracy drop compared to English.
  • ...and 28 more figures