Table of Contents
Fetching ...

Output Format Biases in the Evaluation of Large Language Models for Code Translation

Marcos Macedo, Yuan Tian, Filipe R. Cogo, Bram Adams

TL;DR

The paper identifies a critical bias in evaluating LLMs for code translation: output format bias, where non-code text embedded in model outputs distorts both execution- and text-based metrics. It demonstrates that prompts alone are insufficient to guarantee clean code outputs, and introduces a lightweight mitigation combining prompt engineering with a regular-expression extractor to retrieve code with high reliability (CSR 92.73% and MSR 93.40% on open models). Across 3,820 translation pairs from five programming languages and 11 open-source LLMs (plus five closed models), the study shows that controlling the output format can dramatically improve reported CA (average 31.92% under Controlled+Regex vs 4.92% with direct evaluation) and influence BLEU-based metrics, revealing significant evaluation biases if format is ignored. The results underscore the need for format-aware benchmarking and practical extraction methods, with replication resources released to support future work and adoption in both research and real-world code translation tasks.

Abstract

Code translation between programming languages (PLs) is a critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. Most existing studies instruct LLMs to perform code translation and evaluate their performance by either running the generated outputs through test suites or comparing them to reference outputs (ground truth). These outputs, however, may contain not only executable source code but also additional non-code elements, such as natural language explanations or formatting tokens. We refer to the combination of source code and non-code elements as the output format. It is crucial to understand and address variations in output format, as non-code elements can interfere with evaluation metrics, resulting in biased assessments of model performance and comparisons. We conduct an empirical analysis of the outputs from eleven instruct-tuned open-source LLMs, across five PLs: C, C++, Go, Java, and Python. The results show that between 26.4% and 73.7% of outputs produced by our evaluated LLMs necessitate post-processing. To mitigate output format bias, we propose a strategic combination of prompt engineering and regular expressions that effectively extracts source code from mixed-format outputs, enabling the eleven open-source models to achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our empirical study confirms that output format bias affects widely used execution-based metrics, i.e., Computational Accuracy (CA), and text-based metrics, i.e., BLEU, CodeBLEU and CrystalBLEU. Additionally, we test five closed-source LLMs and observe that they also generate varying distributions of output formats, which could lead to output format biases. Our results highlight the need to mitigate the output format bias to enable reliable evaluations in LLMs for code translation.

Output Format Biases in the Evaluation of Large Language Models for Code Translation

TL;DR

The paper identifies a critical bias in evaluating LLMs for code translation: output format bias, where non-code text embedded in model outputs distorts both execution- and text-based metrics. It demonstrates that prompts alone are insufficient to guarantee clean code outputs, and introduces a lightweight mitigation combining prompt engineering with a regular-expression extractor to retrieve code with high reliability (CSR 92.73% and MSR 93.40% on open models). Across 3,820 translation pairs from five programming languages and 11 open-source LLMs (plus five closed models), the study shows that controlling the output format can dramatically improve reported CA (average 31.92% under Controlled+Regex vs 4.92% with direct evaluation) and influence BLEU-based metrics, revealing significant evaluation biases if format is ignored. The results underscore the need for format-aware benchmarking and practical extraction methods, with replication resources released to support future work and adoption in both research and real-world code translation tasks.

Abstract

Code translation between programming languages (PLs) is a critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. Most existing studies instruct LLMs to perform code translation and evaluate their performance by either running the generated outputs through test suites or comparing them to reference outputs (ground truth). These outputs, however, may contain not only executable source code but also additional non-code elements, such as natural language explanations or formatting tokens. We refer to the combination of source code and non-code elements as the output format. It is crucial to understand and address variations in output format, as non-code elements can interfere with evaluation metrics, resulting in biased assessments of model performance and comparisons. We conduct an empirical analysis of the outputs from eleven instruct-tuned open-source LLMs, across five PLs: C, C++, Go, Java, and Python. The results show that between 26.4% and 73.7% of outputs produced by our evaluated LLMs necessitate post-processing. To mitigate output format bias, we propose a strategic combination of prompt engineering and regular expressions that effectively extracts source code from mixed-format outputs, enabling the eleven open-source models to achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our empirical study confirms that output format bias affects widely used execution-based metrics, i.e., Computational Accuracy (CA), and text-based metrics, i.e., BLEU, CodeBLEU and CrystalBLEU. Additionally, we test five closed-source LLMs and observe that they also generate varying distributions of output formats, which could lead to output format biases. Our results highlight the need to mitigate the output format bias to enable reliable evaluations in LLMs for code translation.
Paper Structure (35 sections, 10 figures, 6 tables)

This paper contains 35 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Examples of the observed output formats in our study. (a) Python code in Direct Output with Additional text (the explanation). (b) C code wrapped with three back-ticks followed by the language extension, with Additional text. (c) Python code that has three back-ticks at the end but no matching opening back-ticks. (d) This output format occurs in RQ2 as the model does not generate code after the back-ticks and instead generates a completely new code block. Some code examples are shortened for brevity.
  • Figure 2: Distribution of program length in Pan et al.'s dataset before the cut-off.
  • Figure 3: Distribution of the token lengths of programs in our dataset.
  • Figure 4: Examples of the Prompt Templates used in WizardCoder. The same instruction is used for all the llm, following the recommended prompt template provided by the authors of each model, as outlined in their respective model cards. The exception is Reference Prompt, which is used as-is.
  • Figure 5: Distribution of output formats observed for each prompt in RQ1. The height of the bar represents the observed proportion of output formats over the sampled generation outputs for the prompt.
  • ...and 5 more figures