Type-aware LLM-based Regression Test Generation for Python Programs
Runlin Liu, Zhe Zhang, Yunge Hu, Yuhang Lin, Xiang Gao, Hailong Sun
TL;DR
The paper targets the problem of generating high-quality regression tests for Python, a dynamically typed language where type errors undermine test validity. It introduces Test4Py, a four-stage framework that combines call graph-guided parameter-centric summaries, behavior-guided type inference (BGPI), type-aware test case generation, and adaptive error repair to produce executable, semantically meaningful tests. Empirical results on 183 real-world modules show Test4Py achieving平均83.0% line and 70.8% branch coverage, outperforming state-of-the-art baselines and exhibiting robustness across different LLMs and type-scarce scenarios. The work also demonstrates the value of interprocedural context and type-informed prompting in improving test quality and fault-detection, with ablations confirming the contributions of BGPI and call-graph summaries.
Abstract
Automated regression test generation has been extensively explored, yet generating high-quality tests for Python programs remains particularly challenging. Because of the Python's dynamic typing features, existing approaches, ranging from search-based software testing (SBST) to recent LLM-driven techniques, are often prone to type errors. Hence, existing methods often generate invalid inputs and semantically inconsistent test cases, which ultimately undermine their practical effectiveness. To address these limitations, we present Test4Py, a novel framework that enhances type correctness in automated test generation for Python. Test4Py leverages the program's call graph to capture richer contextual information about parameters, and introduces a behavior-based type inference mechanism that accurately infers parameter types and construct valid test inputs. Beyond input construction, Test4Py integrates an iterative repair procedure that progressively refines generated test cases to improve coverage. In an evaluation on 183 real-world Python modules, Test4Py achieved an average statement coverage of 83.0% and branch coverage of 70.8%, outperforming state-of-the-art tools by 7.2% and 8.4%, respectively.
