Understanding and Mitigating Language Confusion in LLMs

Kelly Marchisio; Wei-Yin Ko; Alexandre Bérard; Théo Dehaze; Sebastian Ruder

Understanding and Mitigating Language Confusion in LLMs

Kelly Marchisio, Wei-Yin Ko, Alexandre Bérard, Théo Dehaze, Sebastian Ruder

TL;DR

This work identifies a surprising form of error in large language models: language confusion, where outputs fail to match the user's intended language. It introduces the Language Confusion Benchmark (LCB), a scalable evaluation across 15 typologically diverse languages, covering monolingual and cross-lingual generation with multiple data sources and complex prompts. The study reveals that even strong models exhibit substantial language confusion, with English-centric instruction tuning and high-temperature sampling amplifying the problem. It proposes practical mitigations—adjusting decoding hyperparameters, employing few-shot prompts, and applying multilingual instruction tuning—and releases the benchmark to enable ongoing multilingual scrutiny of LLMs. The findings highlight a critical barrier to equal multilingual utility and offer actionable strategies to improve cross-language performance in real-world use cases.

Abstract

We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user's desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation at https://github.com/for-ai/language-confusion.

Understanding and Mitigating Language Confusion in LLMs

TL;DR

Abstract

Paper Structure (65 sections, 6 equations, 6 figures, 31 tables)

This paper contains 65 sections, 6 equations, 6 figures, 31 tables.

Introduction
Language Confusion Benchmark
Generation settings
Monolingual generation
Cross-lingual generation
Language Confusion Metrics
Line-level detection
Word-level detection
Binary evaluation
Data sources
Aya
Dolly
Okapi
ShareGPT
Native prompts (Ours)
...and 50 more sections

Figures (6)

Figure 1: Language Confusion can occur at the word level, line level, or over the entire output response.
Figure 2: A model is vulnerable to world-level language confusion when the number of tokens in the sampling nucleus is high, and the distribution is flat. Metrics: Shannon entropy; in brackets: # of tokens in nucleus.
Figure 3: Effect of Temperature ($T$) in Nucleus Sampling. Tokens in the nucleus at $p=0.75$ are bold. Middle: Effect of $T$ on the softmax probabilities (Equation \ref{['eq:softmax']}). Right: Effect of $T$ on the probabilities of tokens in the nucleus right before sampling (Equation \ref{['eq:normed_nucleus']}). As $T$ increases, the token 狐狸has less chance to be sampled.
Figure A1: Example of non-English word-level language confusion produced by an LLM.
Figure A2: Template used for few-shot prompting the base models. The model's answers are truncated to prevent the generation of new questions. For the instruct variants, we use similar prompting, except that the Q/A examples are formatted as User/Chatbot turns using the model's chat template.
...and 1 more figures

Understanding and Mitigating Language Confusion in LLMs

TL;DR

Abstract

Understanding and Mitigating Language Confusion in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)