Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests
Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, Andy Zaidman
TL;DR
UTGen addresses the challenge of understanding automatically generated unit tests by integrating Large Language Models into a Search-Based Software Testing workflow. The approach refines test data, enriches test code with descriptive comments and meaningful names, and validates compilability, achieving comparable coverage to EvoSuite while improving developer productivity in bug fixing. In a controlled 32-person study, UTGen-enabled tests enabled up to 33% more bugs to be fixed and up to 20% faster task completion, with participants highlighting improvements in test names, data, and naming. The findings suggest practical value in combining SBST with LLMs and point to future enhancements via retrieval-augmented generation and task-specific LLMs to further boost understandability and efficiency.
Abstract
Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants from both academia and industry, we investigate how the understandability of unit tests affects a software engineer's ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.
