Automated Unit Test Improvement using Large Language Models at Meta

Nadia Alshahwan; Jubin Chheda; Anastasia Finegenova; Beliz Gokkaya; Mark Harman; Inna Harper; Alexandru Marginean; Shubho Sengupta; Eddy Wang

Automated Unit Test Improvement using Large Language Models at Meta

Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, Eddy Wang

TL;DR

The paper presents TestGen-LLM, an Assured Offline LLMSE system that extends existing Kotlin unit tests with automatically generated, verifiably improved test cases. It achieves this through a filtration pipeline (build, pass, and coverage-time guarantees) and an ensemble of LLMs, prompts, and hyperparameters that produce fully formed test-class improvements rather than snippets, thereby avoiding regression. Deployed across Instagram and Facebook test-a-thons, TestGen-LLM demonstrates measurable gains (e.g., 75% builds, 57% reliable passes, 25% coverage increase in evaluation) and lands a notable portion of its recommendations in production (≈73% acceptance). The work argues for the practicality and safety of industrial-scale Assured LLMSE, detailing deployment experiences, quantitative outcomes, qualitative observations, and open research directions for improving automated test enhancement. Overall, the approach blends automated generation with verifiable guarantees to deliver meaningful, trusted code improvements in large-scale software systems."

Abstract

This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers. We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.

Automated Unit Test Improvement using Large Language Models at Meta

TL;DR

Abstract

Paper Structure (20 sections, 2 figures, 6 tables)

This paper contains 20 sections, 2 figures, 6 tables.

Introduction
The TestGen-LLM System
Advantages of code improvement style LLM code generation
Measurable improvement
Verifiable guarantees of non-regression
Ensemble approach
Helps humans; does not replace them
Caters well to LLMs with limited token window size
TestGen-LLM Deployment
Initial Trial
Instagram Test-a-thon November 2023
Outcomes from the first Instagram Test-a-thon
Evaluation of LLMs and Prompts
Deployment in TestGen-LLM-only Instagram Test-a-thon December 2023
Deployment at the Facebook App Test-a-thon December 2023
...and 5 more sections

Figures (2)

Figure 1: TestGen-LLM top level architecture (an instance of Assured Offline LLMSE mhetal:intense24-keynote).
Figure 2: Sankey diagram showing the filtration process outcomes (as percentages of all test cases) from the Experimental Study on Instagram components for Reels and Stories products, using the four prompt strategies from Table \ref{['tab:prompts']} and the two language models, LLM1 and LLM2.

Automated Unit Test Improvement using Large Language Models at Meta

TL;DR

Abstract

Automated Unit Test Improvement using Large Language Models at Meta

Authors

TL;DR

Abstract

Table of Contents

Figures (2)