Table of Contents
Fetching ...

Technique to Baseline QE Artefact Generation Aligned to Quality Metrics

Eitan Farchi, Kiran Nayak, Papia Ghosh Majumdar, Saritha Route

TL;DR

This paper tackles quality assurance for QE artefacts generated by Large Language Models by introducing a closed-loop technique that combines initial artefact generation, reverse generation to test semantic preservation, and iterative refinement guided by a quantitative quality rubric and SBERT-based semantic similarity. Artefacts are evaluated across four metrics (Clarity, Completeness, Consistency, Testability) with a 5-point scale, and semantic alignment is assessed via cosine similarity thresholds (High $\geq 0.8$, Medium $0.6$–$0.8$, Low $0.3$–$0.6$, No Match $<0.3$), complemented by Jaccard and entity/verb extraction. Experimental results across 12 projects and 150+ artefact pairs show that reverse-generated artefacts can substantially improve low-quality inputs and that BDD-derived requirements yield higher testability and semantic fidelity than test-derived ones, while high-quality inputs maintain their advantage. The approach reduces manual review by 60–70% and promotes sustainability through lightweight embedding reuse, offering a scalable, auditable framework for responsible LLM adoption in QE workflows. Overall, the method provides a replicable blueprint for balancing automation with accountability in artefact generation and validation, with practical implications for enterprise Agile/DevOps contexts.

Abstract

Large Language Models (LLMs) are transforming Quality Engineering (QE) by automating the generation of artefacts such as requirements, test cases, and Behavior Driven Development (BDD) scenarios. However, ensuring the quality of these outputs remains a challenge. This paper presents a systematic technique to baseline and evaluate QE artefacts using quantifiable metrics. The approach combines LLM-driven generation, reverse generation , and iterative refinement guided by rubrics technique for clarity, completeness, consistency, and testability. Experimental results across 12 projects show that reverse-generated artefacts can outperform low-quality inputs and maintain high standards when inputs are strong. The framework enables scalable, reliable QE artefact validation, bridging automation with accountability.

Technique to Baseline QE Artefact Generation Aligned to Quality Metrics

TL;DR

This paper tackles quality assurance for QE artefacts generated by Large Language Models by introducing a closed-loop technique that combines initial artefact generation, reverse generation to test semantic preservation, and iterative refinement guided by a quantitative quality rubric and SBERT-based semantic similarity. Artefacts are evaluated across four metrics (Clarity, Completeness, Consistency, Testability) with a 5-point scale, and semantic alignment is assessed via cosine similarity thresholds (High , Medium , Low , No Match ), complemented by Jaccard and entity/verb extraction. Experimental results across 12 projects and 150+ artefact pairs show that reverse-generated artefacts can substantially improve low-quality inputs and that BDD-derived requirements yield higher testability and semantic fidelity than test-derived ones, while high-quality inputs maintain their advantage. The approach reduces manual review by 60–70% and promotes sustainability through lightweight embedding reuse, offering a scalable, auditable framework for responsible LLM adoption in QE workflows. Overall, the method provides a replicable blueprint for balancing automation with accountability in artefact generation and validation, with practical implications for enterprise Agile/DevOps contexts.

Abstract

Large Language Models (LLMs) are transforming Quality Engineering (QE) by automating the generation of artefacts such as requirements, test cases, and Behavior Driven Development (BDD) scenarios. However, ensuring the quality of these outputs remains a challenge. This paper presents a systematic technique to baseline and evaluate QE artefacts using quantifiable metrics. The approach combines LLM-driven generation, reverse generation , and iterative refinement guided by rubrics technique for clarity, completeness, consistency, and testability. Experimental results across 12 projects show that reverse-generated artefacts can outperform low-quality inputs and maintain high standards when inputs are strong. The framework enables scalable, reliable QE artefact validation, bridging automation with accountability.

Paper Structure

This paper contains 22 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Technique overview diagram