exLong: Generating Exceptional Behavior Tests with Large Language Models

Jiyang Zhang; Yu Liu; Pengyu Nie; Junyi Jessy Li; Milos Gligoric

exLong: Generating Exceptional Behavior Tests with Large Language Models

Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

TL;DR

exLong presents a novel approach to automatically generate exceptional behavior tests (EBTs) by fine-tuning a CodeLlama-based LLM to reason about traces to throw statements, guard expressions, and non-EBTs. It leverages static and dynamic program analysis to collect traces, compute guard constraints, and assemble context-rich prompts for EBT generation, with two use cases: developer-oriented and machine-oriented. Empirical results show exLong outperforms CAT-LM and GPT-3.5 on developer-oriented tasks and surpasses Randoop and EvoSuite on machine-oriented tasks, with GPT-4o-based variants achieving further gains; 23 exLong-generated EBTs were accepted in open-source projects. These findings indicate exLong can significantly improve exceptional behavior testing and offer practical benefits for code quality assurance.

Abstract

Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on "happy paths", e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed exLong, that automatically generates EBTs. exLong is a large language model instruction fine-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT-4o), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that exLong outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by exLong were already accepted.

exLong: Generating Exceptional Behavior Tests with Large Language Models

TL;DR

Abstract

Paper Structure (36 sections, 4 figures, 9 tables, 4 algorithms)

This paper contains 36 sections, 4 figures, 9 tables, 4 algorithms.

Introduction
Use Cases
Developer-oriented use case
Machine-oriented use case
exLong
Training
Identifying EBTs and non-EBTs
Executing EBT and collecting stack trace
Computing the guard expression
Connecting EBTs to relevant non-EBTs
Instruction fine-tuning
Inference
Collecting non-EBTs' stack traces to reach potential target throw statements
Selecting task inputs
Assembling the prompt
...and 21 more sections

Figures (4)

Figure 1: An EBT ('testUnsupportedAtomSpecialChar') from greenmail-mail-test/greenmail and the target throw statement.
Figure 2: Overview of exLong. Two use cases for exLong: (1) developer-oriented use case and (2) machine-oriented use case. In the developer-oriented use case, a developer specify the method under test, a target throw statement and a destination test file and ask exLong to generate an EBT that cover the target throw statement. In the machine-oriented use case, a developer gives an entire repository to exLong.
Figure 3: EBT ($\mathtt{testFileWriterConstructorMissing}$) generated by GPT3.5 and exLong. The EBT generated by exLong covers the target throw statement satisfying the correct condition.
Figure 4: Venn diagram that shows target throw statements coverage by exLong, Randoop, and EvoSuite.

exLong: Generating Exceptional Behavior Tests with Large Language Models

TL;DR

Abstract

exLong: Generating Exceptional Behavior Tests with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)