Table of Contents
Fetching ...

ASTER: Natural and Multi-language Unit Test Generation with LLMs

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, Saurabh Sinha

TL;DR

ASTER introduces a lightweight static-analysis-guided LLM pipeline for automatic unit-test generation that supports Java and Python and includes environment mocking. It achieves competitive coverage with state-of-the-art tools and produces more natural, maintainable tests, validated by a large developer survey. The work demonstrates that smaller, cost-effective LLMs can rival larger models in enterprise contexts and provides a practical, pluggable framework for multilingual ATG. The findings suggest significant practical impact for accelerating regression testing while improving test readability and developer acceptance, with clear avenues for extending to more languages and tighter integration with on-prem systems.

Abstract

Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how large language models (LLMs) can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.

ASTER: Natural and Multi-language Unit Test Generation with LLMs

TL;DR

ASTER introduces a lightweight static-analysis-guided LLM pipeline for automatic unit-test generation that supports Java and Python and includes environment mocking. It achieves competitive coverage with state-of-the-art tools and produces more natural, maintainable tests, validated by a large developer survey. The work demonstrates that smaller, cost-effective LLMs can rival larger models in enterprise contexts and provides a practical, pluggable framework for multilingual ATG. The findings suggest significant practical impact for accelerating regression testing while improving test readability and developer acceptance, with clear avenues for extending to more languages and tighter integration with on-prem systems.

Abstract

Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how large language models (LLMs) can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.
Paper Structure (13 sections, 6 figures, 2 tables)

This paper contains 13 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Results of the survey question on whether developers would add automatically generated tests to regression test suites.
  • Figure 2: Illustration of naturalness (in terms of test names, variable names, and assertions) and mocking in test cases generated by the LLM-assisted technique of aster (right) compared with tests generated by EvoSuite fraser2011evosuite and CodaMosa lemieux2023codamosa (left).
  • Figure 3: Overview of aster. ①, ②, ③ represent test-generation, test-repair, and coverage-augmentation prompts.
  • Figure 4: Templates for composing prompts for test generation, test repair, and coverage augmentation.
  • Figure 5: Line, branch, and method coverage achieved on Java SE and Java EE applications by aster (configured with different LLMs) and EvoSuite (GPT-4 run excluded for App X for confidentiality reasons).
  • ...and 1 more figures