Testing the Effect of Code Documentation on Large Language Model Code Understanding
William Macke, Michael Doyle
TL;DR
This paper addresses how code documentation quality affects large language model (LLM) code understanding. The authors conduct an empirical study using unit-test generation on variations of the HumanEval dataset, altering documentation quality and content, and evaluating outputs with runtime errors, test failures, success rates, and line coverage for GPT-3.5 and GPT-4. They find that incorrect documentation substantially degrades LLM understanding, while incomplete or missing documentation has little effect on the ability to generate correct unit tests, though code coverage can increase with comments. The results highlight the nuanced role of documentation in LLM code tasks and suggest that model training data and documentation quality influence how effectively models leverage human-provided documentation. The work provides a foundation for more robust evaluation of documentation’s impact and points toward future work including broader language, prompting strategies, and evaluation tasks.
Abstract
Large Language Models (LLMs) have demonstrated impressive abilities in recent years with regards to code generation and understanding. However, little work has investigated how documentation and other code properties affect an LLM's ability to understand and generate code or documentation. We present an empirical analysis of how underlying properties of code or documentation can affect an LLM's capabilities. We show that providing an LLM with "incorrect" documentation can greatly hinder code understanding, while incomplete or missing documentation does not seem to significantly affect an LLM's ability to understand code.
