Table of Contents
Fetching ...

Evaluating Programming Language Confusion

Micheline Bénédicte Moumoula, Abdoul Kader Kabore, Jacques Klein, Tegawendé F. Bissyande

TL;DR

This study systematically quantifies programming language confusion in Code LLMs across generation and translation tasks, revealing pervasive cross-language migration tendencies and a surprising priority for syntactic correctness over strict adherence to user-requested languages. Using BabelCode and CodeNet datasets across eight models, the work defines Language Confusion Pass Rate ($\mathrm{LCPR}$) and Code Parsing Pass Rate ($\mathrm{CPPR}$) to disentangle language fidelity from syntax, and shows that errors are often strategic rather than random. Key findings include a strong pull toward Python as an attractor, task-dependent differences between generation and translation, and model-specific confusion patterns that are not strictly correlated with model size. The paper also proposes mitigation strategies, such as enhanced prompts with language markers, targeted model selection, and post-generation validation, highlighting practical steps to improve multilingual code reliability in real-world software engineering workflows.

Abstract

Large Language Models for code (Code LLMs) have gained significant traction in software engineering, achieving state-of-the-art performance on various programming tasks including code completion, generation, repair, and translation. These models have demonstrated remarkable capabilities in understanding programming concepts, implementing algorithms, and even bridging different programming languages, fundamentally transforming how developers interact with coding environments. Despite these advances, Code LLMs often struggle with programming language confusion--producing code in unintended languages despite explicit instructions or obvious context. We systematically evaluate this phenomenon across diverse programming contexts. Our study assesses seven popular general and Code LLMs across multiple natural and programming languages, analyzing their behavior using four datasets (HumanEval, HumanEval-xl, MBPP, TP3) for code generation and one dataset (CodeNet) for code translation. The study results reveal that language confusion occurs across all evaluated models, with StarCoder and CodeLlama exhibiting the highest confusion rates. Even high-performing models fail to maintain language consistency throughout generated solutions, particularly when handling complex algorithmic problems. We identify key factors contributing to this confusion, including syntactic similarities between programming languages and inconsistent prompt formatting. Interestingly, we find evidence suggesting that LLMs consistently exhibit strategic language migration behaviors, prioritizing languages where they can produce more syntactically correct code even when explicitly instructed otherwise. This phenomenon is particularly pronounced in code generation tasks, where models show strong migration patterns toward Python and between syntactically similar language pairs.

Evaluating Programming Language Confusion

TL;DR

This study systematically quantifies programming language confusion in Code LLMs across generation and translation tasks, revealing pervasive cross-language migration tendencies and a surprising priority for syntactic correctness over strict adherence to user-requested languages. Using BabelCode and CodeNet datasets across eight models, the work defines Language Confusion Pass Rate () and Code Parsing Pass Rate () to disentangle language fidelity from syntax, and shows that errors are often strategic rather than random. Key findings include a strong pull toward Python as an attractor, task-dependent differences between generation and translation, and model-specific confusion patterns that are not strictly correlated with model size. The paper also proposes mitigation strategies, such as enhanced prompts with language markers, targeted model selection, and post-generation validation, highlighting practical steps to improve multilingual code reliability in real-world software engineering workflows.

Abstract

Large Language Models for code (Code LLMs) have gained significant traction in software engineering, achieving state-of-the-art performance on various programming tasks including code completion, generation, repair, and translation. These models have demonstrated remarkable capabilities in understanding programming concepts, implementing algorithms, and even bridging different programming languages, fundamentally transforming how developers interact with coding environments. Despite these advances, Code LLMs often struggle with programming language confusion--producing code in unintended languages despite explicit instructions or obvious context. We systematically evaluate this phenomenon across diverse programming contexts. Our study assesses seven popular general and Code LLMs across multiple natural and programming languages, analyzing their behavior using four datasets (HumanEval, HumanEval-xl, MBPP, TP3) for code generation and one dataset (CodeNet) for code translation. The study results reveal that language confusion occurs across all evaluated models, with StarCoder and CodeLlama exhibiting the highest confusion rates. Even high-performing models fail to maintain language consistency throughout generated solutions, particularly when handling complex algorithmic problems. We identify key factors contributing to this confusion, including syntactic similarities between programming languages and inconsistent prompt formatting. Interestingly, we find evidence suggesting that LLMs consistently exhibit strategic language migration behaviors, prioritizing languages where they can produce more syntactically correct code even when explicitly instructed otherwise. This phenomenon is particularly pronounced in code generation tasks, where models show strong migration patterns toward Python and between syntactically similar language pairs.

Paper Structure

This paper contains 21 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Experimental workflow for programming language evaluation
  • Figure 2: Migration Patterns in Programming Language Confusion during Code Generation (with BabelCode dataset). NOTE: Data for all models are provided in the Supplementary file.
  • Figure 3: Migration Patterns in PL Confusion during Code Translation (with CodeNet dataset) NOTE: Data for all models are provided in the Supplementary file.
  • Figure 4: Directional Analysis of Programming Language Confusion in Translation Tasks.