Large Language Models for Code Summarization
Balázs Szalontai, Gergő Szalay, Tamás Márton, Anna Sike, Balázs Pintér, Tibor Gregorics
TL;DR
This paper surveys open-source LLMs for code, examining their abilities in code generation (text-to-code) and code summarization/explanation (code-to-text) using established benchmarks such as HumanEval, APPS, MBPP, DS-1000, and CodeXGLUE. It explains the evaluation metrics (notably Pass@k, BLEU, and ROUGE) and presents a synthesis of results across multiple models—CodeLlama, WizardCoder, OctoCoder/OctoGeeX, MagiCoder, WaveCoder, DeepSeekCoder, and Llama3. The findings show strong code-generation performance among open models, with Llama3 achieving leading results on HumanEval, while code-explanation benchmarks like HumanEvalExplain reveal more modest, model-dependent capabilities and room for improvement. The work highlights the importance of including explanation-focused benchmarks and notes that several open models do not consistently report code-explanation results, limiting cross-model comparisons. Overall, the paper provides a structured view of current capabilities and gaps in open-source coding LLMs and suggests directions for improving code-understanding capabilities and practical software engineering tooling.
Abstract
Recently, there has been increasing activity in using deep learning for software engineering, including tasks like code generation and summarization. In particular, the most recent coding Large Language Models seem to perform well on these problems. In this technical report, we aim to review how these models perform in code explanation/summarization, while also investigating their code generation capabilities (based on natural language descriptions).
