Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework
Irtaza Sajid Qureshi, Zhen Ming, Jiang
TL;DR
The paper tackles the challenge of evaluating LLM-driven test-case generation from bug reports by introducing a Bloom's taxonomy-based, dynamic, contamination-aware framework. It systematically probes recall, understanding, application, analysis, evaluation, and (future) creation capabilities across Defects4J and GHRB using StarCoder and GPT-4o, incorporating linguistic mutations, identifier mutations, open-book contextual retrieval, and component-level analyses. Key findings show GPT-4o generally robust to language shifts, while identifier mutations severely degrade performance for both models; open-book prompts and reliance on structured code signals significantly boost effectiveness. The study highlights data contamination concerns in benchmarks, proposes robust evaluation practices, and offers concrete directions for improving LLM-based test generation in real-world software engineering. This framework enables more realistic assessment of LLM reasoning in automated testing and informs benchmark design, prompting better-generalizable approaches and interactive capabilities for future research.
Abstract
Large Language Models (LLMs) are increasingly applied to automated software testing, yet their ability to generalize beyond memorized patterns and reason about natural language bug reports remains unclear. We present a systematic evaluation of LLM reasoning in test case generation, structured around the cognitive layers of Bloom's taxonomy: \textit{Remember}, \textit{Understand}, \textit{Apply}, \textit{Analyze}, \textit{Evaluate}, and \textit{Create}, which progressively assess higher levels of cognitive and reasoning capabilities. Building on the LIBRO framework, we evaluate StarCoder and GPT-4o on Defects4J, GHRB, and mutated variants that introduce linguistic and semantic challenges. Our findings show that both models largely reproduce prior results with minor deviations (\textit{Remember}), exhibit partial robustness to linguistic rephrasings and translations while uncovering unique reproducible bugs (\textit{Understand}), but suffer severe performance drops exceeding 60\% under identifier mutations (\textit{Apply}). Conversely, providing near-identical few-shot examples in an open-book setting improves success rates by up to three times, and component-level analysis reveals that structured technical elements, such as test code and method names, are far more impactful than narrative descriptions for successful test generation (\textit{Analyze}). These insights illuminate the cognitive processes underlying LLM-generated tests, suggest concrete directions for improving performance, and establish a robust and realistic evaluation paradigm for this task.
