LeCov: Multi-level Testing Criteria for Large Language Models

Xuan Xie; Jiayang Song; Yuheng Huang; Da Song; Fuyuan Zhang; Felix Juefei-Xu; Lei Ma

LeCov: Multi-level Testing Criteria for Large Language Models

Xuan Xie, Jiayang Song, Yuheng Huang, Da Song, Fuyuan Zhang, Felix Juefei-Xu, Lei Ma

TL;DR

LeCov introduces a formal, multi-level testing framework for LLMs that targets three internal components—attention, feed-forward neurons, and uncertainty—and defines nine criteria across attention-wise, neuron-wise, and uncertainty-wise coverage. The criteria are applied to test prioritization and coverage-guided testing, demonstrated on three open-source models and four datasets, showing improvements in both prioritization accuracy and defect detection. Key contributions include the design of $k$-multisection attention and uncertainty coverages, time-aware neuron criteria, and a mutation-driven CGT pipeline. The findings suggest that internal-structure-aware testing can substantially enhance LLM trustworthiness assessments and guide practical testing workflows.

Abstract

Large Language Models (LLMs) are widely used in many different domains, but because of their limited interpretability, there are questions about how trustworthy they are in various perspectives, e.g., truthfulness and toxicity. Recent research has started developing testing methods for LLMs, aiming to uncover untrustworthy issues, i.e., defects, before deployment. However, systematic and formalized testing criteria are lacking, which hinders a comprehensive assessment of the extent and adequacy of testing exploration. To mitigate this threat, we propose a set of multi-level testing criteria, LeCov, for LLMs. The criteria consider three crucial LLM internal components, i.e., the attention mechanism, feed-forward neurons, and uncertainty, and contain nine types of testing criteria in total. We apply the criteria in two scenarios: test prioritization and coverage-guided testing. The experiment evaluation, on three models and four datasets, demonstrates the usefulness and effectiveness of LeCov.

LeCov: Multi-level Testing Criteria for Large Language Models

TL;DR

-multisection attention and uncertainty coverages, time-aware neuron criteria, and a mutation-driven CGT pipeline. The findings suggest that internal-structure-aware testing can substantially enhance LLM trustworthiness assessments and guide practical testing workflows.

Abstract

Paper Structure (17 sections, 8 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
Background
LLM Defects
Deep Learning System Testing
Testing Criteria for LLM
Attention-wise Coverage Criteria
Neuron-wise Coverage Criteria
Uncertainty-wise Coverage Criteria
Application
Experiment
Experimental Setting
Experimental Evaluation
RQ1: Can the proposed testing criteria approximate the functional feature of LLMs?
RQ2: How effective are the criteria in conducting test prioritization?
RQ3: Are the proposed criteria effective in guiding the testing procedure to find LLM defects?
...and 2 more sections

Figures (2)

Figure 1: Workflow of LeCov.
Figure 2: Test success rate (TSR) of coverage-guided testing. The x-axis is the TSR, and the y-axis is the testing method.

LeCov: Multi-level Testing Criteria for Large Language Models

TL;DR

Abstract

LeCov: Multi-level Testing Criteria for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)