Table of Contents
Fetching ...

English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports

Avinash Patil, Siru Tao, Aryan Jadon

TL;DR

The paper tackles translating multilingual bug reports in the Visual Studio Code project, evaluating seven MT/LLM systems (including ChatGPT, Claude, Gemini, Mistral, LLaMA, AWS Translate, and DeepL) under a consistent protocol. It uses a 1,300-item english-please Bug Report dataset with human English references and multiple evaluation metrics (BLEU, BERTScore, COMET, METEOR, ROUGE) plus language-identification measures to reveal strengths and trade-offs among systems. Results show ChatGPT achieves the strongest translation-quality across all automatic metrics, while AWS Translate excels in language identification accuracy; no single system dominates across tasks. The study highlights the necessity of domain adaptation for technical content and provides actionable guidance for integrating MT into bug-triaging workflows, with code and data available on GitHub.

Abstract

Accurate translation of bug reports is critical for efficient collaboration in global software development. In this study, we conduct the first comprehensive evaluation of machine translation (MT) performance on bug reports, analyzing the capabilities of DeepL, AWS Translate, and large language models such as ChatGPT, Claude, Gemini, LLaMA, and Mistral using data from the Visual Studio Code GitHub repository, specifically focusing on reports labeled with the english-please tag. To assess both translation quality and source language identification accuracy, we employ a range of MT evaluation metrics-including BLEU, BERTScore, COMET, METEOR, and ROUGE-alongside classification metrics such as accuracy, precision, recall, and F1-score. Our findings reveal that while ChatGPT (gpt-4o) excels in semantic and lexical translation quality, it does not lead in source language identification. Claude and Mistral achieve the highest F1-scores (0.7182 and 0.7142, respectively), and Gemini records the best precision (0.7414). AWS Translate shows the highest accuracy (0.4717) in identifying source languages. These results highlight that no single system dominates across all tasks, reinforcing the importance of task-specific evaluations. This study underscores the need for domain adaptation when translating technical content and provides actionable insights for integrating MT into bug-triaging workflows. The code and dataset for this paper are available at GitHub-https://github.com/av9ash/English-Please

English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports

TL;DR

The paper tackles translating multilingual bug reports in the Visual Studio Code project, evaluating seven MT/LLM systems (including ChatGPT, Claude, Gemini, Mistral, LLaMA, AWS Translate, and DeepL) under a consistent protocol. It uses a 1,300-item english-please Bug Report dataset with human English references and multiple evaluation metrics (BLEU, BERTScore, COMET, METEOR, ROUGE) plus language-identification measures to reveal strengths and trade-offs among systems. Results show ChatGPT achieves the strongest translation-quality across all automatic metrics, while AWS Translate excels in language identification accuracy; no single system dominates across tasks. The study highlights the necessity of domain adaptation for technical content and provides actionable guidance for integrating MT into bug-triaging workflows, with code and data available on GitHub.

Abstract

Accurate translation of bug reports is critical for efficient collaboration in global software development. In this study, we conduct the first comprehensive evaluation of machine translation (MT) performance on bug reports, analyzing the capabilities of DeepL, AWS Translate, and large language models such as ChatGPT, Claude, Gemini, LLaMA, and Mistral using data from the Visual Studio Code GitHub repository, specifically focusing on reports labeled with the english-please tag. To assess both translation quality and source language identification accuracy, we employ a range of MT evaluation metrics-including BLEU, BERTScore, COMET, METEOR, and ROUGE-alongside classification metrics such as accuracy, precision, recall, and F1-score. Our findings reveal that while ChatGPT (gpt-4o) excels in semantic and lexical translation quality, it does not lead in source language identification. Claude and Mistral achieve the highest F1-scores (0.7182 and 0.7142, respectively), and Gemini records the best precision (0.7414). AWS Translate shows the highest accuracy (0.4717) in identifying source languages. These results highlight that no single system dominates across all tasks, reinforcing the importance of task-specific evaluations. This study underscores the need for domain adaptation when translating technical content and provides actionable insights for integrating MT into bug-triaging workflows. The code and dataset for this paper are available at GitHub-https://github.com/av9ash/English-Please

Paper Structure

This paper contains 11 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Bug reports over time.
  • Figure 2: Top 20 most common labels in bug reports.
  • Figure 3: Violin plots of different MT evaluation metrics across AWS, GPT, and DeepL translation tools.
  • Figure 4: Confusion matrices for language identification across seven machine translation models. Darker diagonal entries indicate correct classifications.