Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

Djamel Eddine Khelladi; Charly Reux; Mathieu Acher

Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

Djamel Eddine Khelladi, Charly Reux, Mathieu Acher

TL;DR

PolyTest addresses the fragility of single-shot, single-language LLM-based test generation by exploiting the polyglot capabilities and temperature-driven diversity of LLMs. It generates tests across multiple languages at temperature $0$ and across multiple generations within a language at temperature $1$, then unifies these sets to detect and resolve contradictions, improving test quantity, passing rate, coverage, and mutation score without requiring on-the-fly execution. Evaluated on EvalPlus with three LLMs and four target languages plus a CSV format, PolyTest consistently outperforms single-language baselines and Pynguin in key quality metrics, demonstrating robust gains in both coverage-related metrics and mutation-based quality. The approach also enables self-consistency checks by highlighting and filtering contradicting tests across languages and generations. The authors provide a replication package to support reproducibility and discuss implications for applying PolyTest to weaker languages and across multiple LLMs.

Abstract

Large language model (LLM)-based test generation has gained attention in software engineering, yet most studies evaluate LLMs' ability to generate unit tests in a single attempt for a given language, missing the opportunity to leverage LLM diversity for more robust testing. This paper introduces PolyTest, a novel approach that enhances test generation by exploiting polyglot and temperature-controlled diversity. PolyTest systematically leverages these properties in two complementary ways: (1) Cross-lingual test generation, where tests are generated in multiple languages at zero temperature and then unified; (2) Diverse test sampling, where multiple test sets are generated within the same language at a higher temperature before unification. A key insight is that LLMs can generate diverse yet contradicting tests -- same input, different expected outputs -- across languages and generations. PolyTest mitigates inconsistencies by unifying test sets, fostering self-consistency and improving overall test quality. Unlike single-language or single-attempt approaches, PolyTest enhances testing without requiring on-the-fly execution, making it particularly beneficial for weaker-performing languages. We evaluate PolyTest on Llama3-70B, GPT-4o, and GPT-3.5 using EvalPlus, generating tests in five languages (Java, C, Python, JavaScript, and a CSV-based format) at temperature 0 and sampling multiple sets at temperature 1. We observe that LLMs frequently generate contradicting tests across settings, and that PolyTest significantly improves test quality across all considered metrics -- number of tests, passing rate, statement/branch coverage (up to +9.01%), and mutation score (up to +11.23%). Finally, PolyTest outperforms Pynguin in test generation, passing rate, and mutation score.

Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

TL;DR

Abstract

Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)