Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models

Chen Yang; Ziqi Wang; Yanjie Jiang; Lin Yang; Yuteng Zheng; Jianyi Zhou; Junjie Chen

Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models

Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, Junjie Chen

TL;DR

This work tackles Python type errors by integrating context-aware type constraint analysis with a reflective validation loop to guide LLM-based unit-test generation. The rTED framework performs backward constraint propagation along invocation chains, generates bug-revealing tests with rich contextual prompts, and uses three reflection agents plus meta-evaluation to prune false positives. Empirical results on BugsInPy and TypeBugs show rTED detects significantly more type errors and achieves a large precision improvement over static and other LLM-based baselines, while also discovering previously unknown errors in real-world projects. The approach demonstrates strong practical impact and generalizes across LLMs and languages, offering a robust blueprint for precise, scalable type-error detection in Python software.

Abstract

Type errors in Python often lead to runtime failures, posing significant challenges to software reliability and developer productivity. Existing static analysis tools aim to detect such errors without execution but frequently suffer from high false positive rates. Recently, unit test generation techniques offer great promise in achieving high test coverage, but they often struggle to produce bug-revealing tests without tailored guidance. To address these limitations, we present RTED, a novel type-aware test generation technique for automatically detecting Python type errors. Specifically, RTED combines step-by-step type constraint analysis with reflective validation to guide the test generation process and effectively suppress false positives. We evaluated RTED on two widely-used benchmarks, BugsInPy and TypeBugs. Experimental results show that RTED can detect 22-29 more benchmarked type errors than four state-of-the-art techniques. RTED is also capable of producing fewer false positives, achieving an improvement of 173.9%-245.9% in precision. Furthermore, RTED successfully discovered 12 previously unknown type errors from six real-world open-source Python projects.

Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models

TL;DR

Abstract

Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)