Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

Zixiang Xu; Yanbo Wang; Yue Huang; Xiuying Chen; Jieyu Zhao; Meng Jiang; Xiangliang Zhang

Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

Zixiang Xu, Yanbo Wang, Yue Huang, Xiuying Chen, Jieyu Zhao, Meng Jiang, Xiangliang Zhang

TL;DR

This work tackles the challenge of cross-lingual weaknesses in multilingual large language models by introducing a beam-search–based probing pipeline guided by LLM-based simulation to automatically generate bilingual question pairs. The authors build a sizable 16-language dataset (over 6,000 bilingual pairs) and demonstrate that state-of-the-art models exhibit substantial accuracy declines in target languages despite near-perfect English performance. A key finding is that linguistic similarity strongly shapes cross-lingual weaknesses and transfer, with fine-tuning on a language benefiting closely related languages more. The approach offers a practical diagnostic tool for targeted cross-lingual improvement and data augmentation, providing a solid foundation for more robust multilingual LLMs.

Abstract

Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.

Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

TL;DR

Abstract

Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (33)