Testing the Effect of Code Documentation on Large Language Model Code Understanding

William Macke; Michael Doyle

Testing the Effect of Code Documentation on Large Language Model Code Understanding

William Macke, Michael Doyle

TL;DR

This paper addresses how code documentation quality affects large language model (LLM) code understanding. The authors conduct an empirical study using unit-test generation on variations of the HumanEval dataset, altering documentation quality and content, and evaluating outputs with runtime errors, test failures, success rates, and line coverage for GPT-3.5 and GPT-4. They find that incorrect documentation substantially degrades LLM understanding, while incomplete or missing documentation has little effect on the ability to generate correct unit tests, though code coverage can increase with comments. The results highlight the nuanced role of documentation in LLM code tasks and suggest that model training data and documentation quality influence how effectively models leverage human-provided documentation. The work provides a foundation for more robust evaluation of documentation’s impact and points toward future work including broader language, prompting strategies, and evaluation tasks.

Abstract

Large Language Models (LLMs) have demonstrated impressive abilities in recent years with regards to code generation and understanding. However, little work has investigated how documentation and other code properties affect an LLM's ability to understand and generate code or documentation. We present an empirical analysis of how underlying properties of code or documentation can affect an LLM's capabilities. We show that providing an LLM with "incorrect" documentation can greatly hinder code understanding, while incomplete or missing documentation does not seem to significantly affect an LLM's ability to understand code.

Testing the Effect of Code Documentation on Large Language Model Code Understanding

TL;DR

Abstract

Paper Structure (7 sections, 5 figures, 2 tables)

This paper contains 7 sections, 5 figures, 2 tables.

Introduction
Related Works
LLM Code Understanding
Experimental Setup
Results
Conclusions and Discussion
Limitations

Figures (5)

Figure 1: Example HumanEval reference implementation with docstring.
Figure 2: Proportion of runtime errors or failed tests that happen with GPT-3.5 (left) and GPT-4 (right) generating unit tests on modified versions of HumanEval code.
Figure 3: Proportion of runtime errors or failed tests that happen with GPT-3.5 (left) and GPT-4 (right) generating unit tests on different proportions of docstring lines kept on HumanEval code.
Figure 4: Average percent of line coverage with GPT-3.5 (left) and GPT-4 (right) generating unit tests on modified versions of HumanEval code.
Figure 5: Average percent of line coverage with GPT-3.5 (left) and GPT-4 (right) generating unit tests on different proportions of docstring lines kept on HumanEval code.

Testing the Effect of Code Documentation on Large Language Model Code Understanding

TL;DR

Abstract

Testing the Effect of Code Documentation on Large Language Model Code Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (5)