Table of Contents
Fetching ...

Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications

Alexandre Cristovão Maiorano

Abstract

LLM applications are AI systems whose non-deterministic outputs and evolving model behavior make traditional testing insufficient for release governance. We present an automated self-testing framework that introduces quality gates with evidence-based release decisions (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage. We evaluate the framework through a longitudinal case study of an internally deployed multi-agent conversational AI system with specific marketing capabilities in active development, covering 38 evaluation runs across 20+ internal releases. The gate identified two ROLLBACK-grade builds in early runs and supported stable quality evolution over a four-week staging lifecycle while exercising persona-grounded, multi-turn, adversarial, and evidence-required scenarios. Statistical analysis (Mann-Kendall trends, Spearman correlations, bootstrap confidence intervals), gate ablation, and overhead scaling indicate that evidence coverage is the primary severe-regression discriminator and that runtime scales predictably with suite size. A human calibration study (n=60 stratified cases, two independent evaluators, LLM-as-judge cross-validation) reveals complementary multi-modal coverage: LLM-judge disagreements with the system gate (kappa=0.13) are attributable to structural failure modes such as latency violations and routing errors that are invisible in response text alone, while the judge independently surfaces content quality failures missed by structural checks, validating the multi-dimensional gate design. The framework, supplementary pseudocode, and calibration artifacts are provided to support AI-system quality assurance and independent replication.

Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications

Abstract

LLM applications are AI systems whose non-deterministic outputs and evolving model behavior make traditional testing insufficient for release governance. We present an automated self-testing framework that introduces quality gates with evidence-based release decisions (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage. We evaluate the framework through a longitudinal case study of an internally deployed multi-agent conversational AI system with specific marketing capabilities in active development, covering 38 evaluation runs across 20+ internal releases. The gate identified two ROLLBACK-grade builds in early runs and supported stable quality evolution over a four-week staging lifecycle while exercising persona-grounded, multi-turn, adversarial, and evidence-required scenarios. Statistical analysis (Mann-Kendall trends, Spearman correlations, bootstrap confidence intervals), gate ablation, and overhead scaling indicate that evidence coverage is the primary severe-regression discriminator and that runtime scales predictably with suite size. A human calibration study (n=60 stratified cases, two independent evaluators, LLM-as-judge cross-validation) reveals complementary multi-modal coverage: LLM-judge disagreements with the system gate (kappa=0.13) are attributable to structural failure modes such as latency violations and routing errors that are invisible in response text alone, while the judge independently surfaces content quality failures missed by structural checks, validating the multi-dimensional gate design. The framework, supplementary pseudocode, and calibration artifacts are provided to support AI-system quality assurance and independent replication.
Paper Structure (56 sections, 6 figures, 12 tables, 2 algorithms)

This paper contains 56 sections, 6 figures, 12 tables, 2 algorithms.

Figures (6)

  • Figure 1: High-level interaction between the static/dynamic Question Bank and the multi-agent LLM orchestrator during evaluation.
  • Figure 2: CI/CD integration pipeline. A merge to the main branch triggers automated build checkout, question bank loading, full test suite execution with OpenTelemetry trace collection, five-dimensional metric computation, and deterministic gate decision. HOLD and ROLLBACK events feed back into the question bank expansion loop, preventing test suite drift.
  • Figure 3: Decision flowchart: PROMOTE, HOLD, or ROLLBACK based on five quality dimensions.
  • Figure 4: Success rate across 38 evaluation runs. Green markers indicate PROMOTE decisions; red markers indicate ROLLBACK. The dashed line marks the 80% acceptance threshold.
  • Figure 5: Distribution of P95 latency by suite phase labels (D13, C59, C86-88, C106, C133). The increasing trend ($\tau = 0.374$) correlates with growing test-suite complexity.
  • ...and 1 more figures