Retrieval-Augmented Test Generation: How Far Are We?

Jiho Shin; Nima Shiri Harzevili; Reem Aleithan; Hadi Hemmati; Song Wang

Retrieval-Augmented Test Generation: How Far Are We?

Jiho Shin, Nima Shiri Harzevili, Reem Aleithan, Hadi Hemmati, Song Wang

TL;DR

The paper investigates Retrieval-Augmented Generation (RAG) for automated unit test generation in five Python ML/DL libraries using four state-of-the-art LLMs. It compares three knowledge sources (API documentation, GitHub issues, and StackOverflow Q&As) under Basic RAG and API-level RAG, assessing syntactic correctness, execution, line coverage, and bug detectability. Key findings show RAG improves line coverage by about 6.5% on average, with API-level RAG leveraging GitHub issues delivering the strongest gains and enabling the discovery of real bugs (28 total, 24 new, 10 confirmed). The study provides practical guidance on building API-focused RAG pipelines and highlights future work on targeted retrieval techniques to optimize coverage and fault detection in ML/DL library tests.

Abstract

Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.

Retrieval-Augmented Test Generation: How Far Are We?

TL;DR

Abstract

Retrieval-Augmented Test Generation: How Far Are We?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)