Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases
Hidetake Tanaka, Haruto Tanaka, Kazumasa Shimari, Kenichi Matsumoto
TL;DR
This work evaluates edge-case detection in LLM-generated code by comparing Property-based Testing (PBT) and Example-based Testing (EBT) using Claude-4-sonnet on edge-case cases drawn from HumanEval. It finds that PBT and EBT individually detect bugs in $11/16$ cases ($68.75\%$) while a hybrid approach detects bugs in $13/16$ cases ($81.25\%$), demonstrating complementary strengths: PBT excels at exploring large input spaces and potential performance issues, whereas EBT emphasizes explicit boundary conditions and patterns. The study provides practical guidelines for test-generation strategies in LLM-based coding, recommending a staged hybrid workflow (PBT first, then EBT) and emphasizing careful test-range design and boundary-condition verification. Overall, the results indicate that combining PBT and EBT can significantly improve the reliability of LLM-generated code in real development settings and motivate broader evaluation across more models and datasets.
Abstract
As Large Language Models (LLMs) increasingly generate code in software development, ensuring the quality of LLM-generated code has become important. Traditional testing approaches using Example-based Testing (EBT) often miss edge cases -- defects that occur at boundary values, special input patterns, or extreme conditions. This research investigates the characteristics of LLM-generated Property-based Testing (PBT) compared to EBT for exploring edge cases. We analyze 16 HumanEval problems where standard solutions failed on extended test cases, generating both PBT and EBT test codes using Claude-4-sonnet. Our experimental results reveal that while each method individually achieved a 68.75\% bug detection rate, combining both approaches improved detection to 81.25\%. The analysis demonstrates complementary characteristics: PBT effectively detects performance issues and edge cases through extensive input space exploration, while EBT effectively detects specific boundary conditions and special patterns. These findings suggest that a hybrid approach leveraging both testing methods can improve the reliability of LLM-generated code, providing guidance for test generation strategies in LLM-based code generation.
