A Comparative Analysis of Large Language Models for Code Documentation Generation

Shubhang Shekhar Dvivedi; Vyshnav Vijay; Sai Leela Rahul Pujari; Shoumik Lodh; Dhruv Kumar

A Comparative Analysis of Large Language Models for Code Documentation Generation

Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai Leela Rahul Pujari, Shoumik Lodh, Dhruv Kumar

TL;DR

This study addresses the problem of automatically generating high-quality code documentation using large language models (LLMs) across multiple levels of documentation. It compares five LLMs (GPT-3.5, GPT-4, Bard, LLama2, StarChat) on 14 Python snippets, producing 70 model-generated documentations plus 14 human references, evaluated with a checklist-based, objective framework across six metrics. The findings show that, excluding StarChat, all LLMs outperform human documentation, with closed-source models generally outperforming open-source ones; GPT-4, while accurate, incurs the longest generation time. The results guide practitioners in selecting LLMs and documentation levels, and point to opportunities to optimize generation speed and improve file-level documentation for broader practical impact.

Abstract

This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different levels of code documentation. Our evaluation employs a checklist-based system to minimize subjectivity, providing a more objective assessment. We find that, barring Starchat, all LLMs consistently outperform the original documentation. Notably, closed-source models GPT-3.5, GPT-4, and Bard exhibit superior performance across various parameters compared to open-source/source-available LLMs, namely LLama 2 and StarChat. Considering the time taken for generation, GPT-4 demonstrated the longest duration, followed by Llama2, Bard, with ChatGPT and Starchat having comparable generation times. Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation.

A Comparative Analysis of Large Language Models for Code Documentation Generation

TL;DR

Abstract

A Comparative Analysis of Large Language Models for Code Documentation Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)