Table of Contents
Fetching ...

Retrieval-Augmented Test Generation: How Far Are We?

Jiho Shin, Nima Shiri Harzevili, Reem Aleithan, Hadi Hemmati, Song Wang

TL;DR

The paper investigates Retrieval-Augmented Generation (RAG) for automated unit test generation in five Python ML/DL libraries using four state-of-the-art LLMs. It compares three knowledge sources (API documentation, GitHub issues, and StackOverflow Q&As) under Basic RAG and API-level RAG, assessing syntactic correctness, execution, line coverage, and bug detectability. Key findings show RAG improves line coverage by about 6.5% on average, with API-level RAG leveraging GitHub issues delivering the strongest gains and enabling the discovery of real bugs (28 total, 24 new, 10 confirmed). The study provides practical guidance on building API-focused RAG pipelines and highlights future work on targeted retrieval techniques to optimize coverage and fault detection in ML/DL library tests.

Abstract

Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.

Retrieval-Augmented Test Generation: How Far Are We?

TL;DR

The paper investigates Retrieval-Augmented Generation (RAG) for automated unit test generation in five Python ML/DL libraries using four state-of-the-art LLMs. It compares three knowledge sources (API documentation, GitHub issues, and StackOverflow Q&As) under Basic RAG and API-level RAG, assessing syntactic correctness, execution, line coverage, and bug detectability. Key findings show RAG improves line coverage by about 6.5% on average, with API-level RAG leveraging GitHub issues delivering the strongest gains and enabling the discovery of real bugs (28 total, 24 new, 10 confirmed). The study provides practical guidance on building API-focused RAG pipelines and highlights future work on targeted retrieval techniques to optimize coverage and fault detection in ML/DL library tests.

Abstract

Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.
Paper Structure (26 sections, 7 figures, 6 tables)

This paper contains 26 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of the proposed method.
  • Figure 2: The template used for basic instruction prompting.
  • Figure 3: Additional prompt used for augmented generation.
  • Figure 4: The left two sub-figures show the win counts of the RAG approaches vs the basic instruction prompting (BI). Cmb denotes combined RAG, API denotes API documents, GH denotes GitHub issues, and SO denotes StackOverflow Q&As. The right two sub-figures show the win counts within the RAG approaches.
  • Figure 5: An example bug detected by a generated test.
  • ...and 2 more figures