Table of Contents
Fetching ...

Large Language Models for Code Summarization

Balázs Szalontai, Gergő Szalay, Tamás Márton, Anna Sike, Balázs Pintér, Tibor Gregorics

TL;DR

This paper surveys open-source LLMs for code, examining their abilities in code generation (text-to-code) and code summarization/explanation (code-to-text) using established benchmarks such as HumanEval, APPS, MBPP, DS-1000, and CodeXGLUE. It explains the evaluation metrics (notably Pass@k, BLEU, and ROUGE) and presents a synthesis of results across multiple models—CodeLlama, WizardCoder, OctoCoder/OctoGeeX, MagiCoder, WaveCoder, DeepSeekCoder, and Llama3. The findings show strong code-generation performance among open models, with Llama3 achieving leading results on HumanEval, while code-explanation benchmarks like HumanEvalExplain reveal more modest, model-dependent capabilities and room for improvement. The work highlights the importance of including explanation-focused benchmarks and notes that several open models do not consistently report code-explanation results, limiting cross-model comparisons. Overall, the paper provides a structured view of current capabilities and gaps in open-source coding LLMs and suggests directions for improving code-understanding capabilities and practical software engineering tooling.

Abstract

Recently, there has been increasing activity in using deep learning for software engineering, including tasks like code generation and summarization. In particular, the most recent coding Large Language Models seem to perform well on these problems. In this technical report, we aim to review how these models perform in code explanation/summarization, while also investigating their code generation capabilities (based on natural language descriptions).

Large Language Models for Code Summarization

TL;DR

This paper surveys open-source LLMs for code, examining their abilities in code generation (text-to-code) and code summarization/explanation (code-to-text) using established benchmarks such as HumanEval, APPS, MBPP, DS-1000, and CodeXGLUE. It explains the evaluation metrics (notably Pass@k, BLEU, and ROUGE) and presents a synthesis of results across multiple models—CodeLlama, WizardCoder, OctoCoder/OctoGeeX, MagiCoder, WaveCoder, DeepSeekCoder, and Llama3. The findings show strong code-generation performance among open models, with Llama3 achieving leading results on HumanEval, while code-explanation benchmarks like HumanEvalExplain reveal more modest, model-dependent capabilities and room for improvement. The work highlights the importance of including explanation-focused benchmarks and notes that several open models do not consistently report code-explanation results, limiting cross-model comparisons. Overall, the paper provides a structured view of current capabilities and gaps in open-source coding LLMs and suggests directions for improving code-understanding capabilities and practical software engineering tooling.

Abstract

Recently, there has been increasing activity in using deep learning for software engineering, including tasks like code generation and summarization. In particular, the most recent coding Large Language Models seem to perform well on these problems. In this technical report, we aim to review how these models perform in code explanation/summarization, while also investigating their code generation capabilities (based on natural language descriptions).
Paper Structure (23 sections, 1 equation, 1 figure, 9 tables)

This paper contains 23 sections, 1 equation, 1 figure, 9 tables.

Figures (1)

  • Figure 1: The LLMs we review in this report. If a model was obtained by fine-tuning, it is connected to its base model. Families of models are highlighted using the same color, while StarCoder and CodeGeeX2 are gray indicating that they are not discussed in this report.