Table of Contents
Fetching ...

Human-Calibrated Automated Testing and Validation of Generative Language Models

Agus Sudjianto, Aijun Zhang, Srinivas Neppalli, Tarun Joshi, Michal Malohlava

TL;DR

This paper tackles the problem of validating generative language models in high-stakes domains by leveraging the bounded grounding of Retrieval-Augmented Generation. It introduces the Human-Calibrated Automated Testing (HCAT) framework, which combines automatic test generation via topic-modeling-based stratified sampling, embedding-based functionality and risk metrics, and a two-stage human calibration (probability calibration and conformal prediction) to align automated scores with human judgments. Functionality evaluation is grounded in metrics for relevance, groundedness, completeness, and answer relevancy, with complementary use of sentence-level similarity, natural language inference, and Wasserstein distance to capture both local and distribution-level aspects. The framework also emphasizes robustness checks and targeted weakness identification, enabling scalable, transparent, and regulatory-friendly GLM validation for RAG systems in banking, ultimately supporting safer deployment and ongoing risk mitigation.

Abstract

This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.

Human-Calibrated Automated Testing and Validation of Generative Language Models

TL;DR

This paper tackles the problem of validating generative language models in high-stakes domains by leveraging the bounded grounding of Retrieval-Augmented Generation. It introduces the Human-Calibrated Automated Testing (HCAT) framework, which combines automatic test generation via topic-modeling-based stratified sampling, embedding-based functionality and risk metrics, and a two-stage human calibration (probability calibration and conformal prediction) to align automated scores with human judgments. Functionality evaluation is grounded in metrics for relevance, groundedness, completeness, and answer relevancy, with complementary use of sentence-level similarity, natural language inference, and Wasserstein distance to capture both local and distribution-level aspects. The framework also emphasizes robustness checks and targeted weakness identification, enabling scalable, transparent, and regulatory-friendly GLM validation for RAG systems in banking, ultimately supporting safer deployment and ongoing risk mitigation.

Abstract

This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.

Paper Structure

This paper contains 23 sections, 21 equations, 7 figures.

Figures (7)

  • Figure 1: Topic modeling through dimensionality reduction, clustering, and topic extraction.
  • Figure 2: RAG System Components and Functionality Evaluation
  • Figure 3: Calibration Diagram of Machine and Human Evaluations
  • Figure 4: An illustration of calibration for machine-human groundedness evaluation, using logistic regression and conformal prediction.
  • Figure 5: Marginal (topic) weakness analysis: recall & precision
  • ...and 2 more figures