Table of Contents
Fetching ...

A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Jonathan Katzy, Yongcheng Huang, Gopal-Raj Panchu, Maksym Ziemlewski, Paris Loizides, Sander Vermeulen, Arie van Deursen, Maliheh Izadi

TL;DR

This study addresses the challenge of non-English code comment generation by large language models, evaluating five state-of-the-art code LLMs across five languages. It combines qualitative open coding to derive a taxonomy of $26$ error categories and a dataset of $12{,}500$ labeled generations with an assessment of how well common metrics track comment correctness across languages. The findings show that semantic errors dominate non-English outputs and that neural evaluation metrics often conflate noise with genuine predictions, raising concerns about automated judgment in multilingual settings. The work advocates diversifying training data for non-English code, retaining human evaluation in assessment pipelines, and providing the shared dataset to benchmark and guide future improvements in multilingual code commenting.

Abstract

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment \textit{correctness} across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.

A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

TL;DR

This study addresses the challenge of non-English code comment generation by large language models, evaluating five state-of-the-art code LLMs across five languages. It combines qualitative open coding to derive a taxonomy of error categories and a dataset of labeled generations with an assessment of how well common metrics track comment correctness across languages. The findings show that semantic errors dominate non-English outputs and that neural evaluation metrics often conflate noise with genuine predictions, raising concerns about automated judgment in multilingual settings. The work advocates diversifying training data for non-English code, retaining human evaluation in assessment pipelines, and providing the shared dataset to benchmark and guide future improvements in multilingual code commenting.

Abstract

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment \textit{correctness} across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.

Paper Structure

This paper contains 39 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Example input used for inference
  • Figure 2: Example Evaluation of a Prediction
  • Figure 3: Metric scores comparing LLM generated comments, to random samples of tokens using two separate distributions.
  • Figure 4: Strip plot, showing the scores assigned to comment generations by different metrics.