Table of Contents
Fetching ...

TEL'M: Test and Evaluation of Language Models

George Cybenko, Joshua Ackerman, Paul Lintilhac

TL;DR

TEL'M proposes a five-ingredient framework for principled LM evaluation that emphasizes task concreteness, observable properties, and rigorous measurement design. It introduces a formal metric architecture with simple, compound, and higher-order property metrics, underpinned by probabilistic guarantees, and discusses design and execution practices to enable reproducible, end-use–driven assessments. The approach aims to move LM evaluation toward rigorous, industrial-grade testing akin to healthcare and defense, with a parity-case study illustrating practical reporting. Collectively, TEL'M offers a structured path to quantify LM capabilities and limitations for high-value commercial, government, and national security applications, enabling more reliable, transferable insights across evolving AI systems.

Abstract

Language Models have demonstrated remarkable capabilities on some tasks while failing dramatically on others. The situation has generated considerable interest in understanding and comparing the capabilities of various Language Models (LMs) but those efforts have been largely ad hoc with results that are often little more than anecdotal. This is in stark contrast with testing and evaluation processes used in healthcare, radar signal processing, and other defense areas. In this paper, we describe Test and Evaluation of Language Models (TEL'M) as a principled approach for assessing the value of current and future LMs focused on high-value commercial, government and national security applications. We believe that this methodology could be applied to other Artificial Intelligence (AI) technologies as part of the larger goal of "industrializing" AI.

TEL'M: Test and Evaluation of Language Models

TL;DR

TEL'M proposes a five-ingredient framework for principled LM evaluation that emphasizes task concreteness, observable properties, and rigorous measurement design. It introduces a formal metric architecture with simple, compound, and higher-order property metrics, underpinned by probabilistic guarantees, and discusses design and execution practices to enable reproducible, end-use–driven assessments. The approach aims to move LM evaluation toward rigorous, industrial-grade testing akin to healthcare and defense, with a parity-case study illustrating practical reporting. Collectively, TEL'M offers a structured path to quantify LM capabilities and limitations for high-value commercial, government, and national security applications, enabling more reliable, transferable insights across evolving AI systems.

Abstract

Language Models have demonstrated remarkable capabilities on some tasks while failing dramatically on others. The situation has generated considerable interest in understanding and comparing the capabilities of various Language Models (LMs) but those efforts have been largely ad hoc with results that are often little more than anecdotal. This is in stark contrast with testing and evaluation processes used in healthcare, radar signal processing, and other defense areas. In this paper, we describe Test and Evaluation of Language Models (TEL'M) as a principled approach for assessing the value of current and future LMs focused on high-value commercial, government and national security applications. We believe that this methodology could be applied to other Artificial Intelligence (AI) technologies as part of the larger goal of "industrializing" AI.
Paper Structure (14 sections, 9 equations, 9 figures, 3 tables)

This paper contains 14 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Many existing language model evaluations fail to follow or document these simple steps.
  • Figure 2: Hoeffding's Inequality hoeffding1994probability can be used to bound the number of samples needed to get confidence $1-\alpha$ for the probability that a property, $P = P[\mu-\epsilon,\mu+\epsilon]$ as described in the Confidence Intervals (CI) box. For example, to get an estimate of that probability to within $\epsilon=0.1$ with confidence $1-\alpha = 0.95$, we need at least 738 samples. Constraints and more sophisticated models for the property could significantly reduce that number.
  • Figure 3: Ideally, a property metric would provide some insight into how far the LM under test is from having the property. For example, saying that a system is 90% accurate should translate into meaning that 10% of the responses would have to be changed to make the system 100% accurate. As discussed in the text, this can be translated rigorously into meaning that the system is a distance of 0.1 away from the property. However, the notion of "distance" from an LM, ${\cal L}$ to the property under consideration has several nuanced considerations as depicted in this figure. For example, the "closest" system with the property may not be in the architecture class of language models (such as transformers with the same specific sizes and configurations of attention and feedforward layers as the LM under test, not just the generic class of all transformers) to which the LM under test belongs. The problem of determining whether any model in that same architectures class has the property is typically a difficult theory oriented question about the representational powers and trainability of language model architectures.
  • Figure 4: Simple task property metrics can be determined by independent samples of an LM's prompt-response behavior that are aggregated by averaging. By contrast, compound task property metrics, such as monotonicity, require a multiplicity of prompt-response pairs that are not simply averaged although subsequent processing of averages may be needed to make inferences about a compound property. Another dimension along which task property metrics vary is their "order". A first order property can be determined prima facie from the prompt-response pairs by averaging. On the other hand, second and higher order properties defer to other applications (possibly other LMs, parsers, compilers, optimization packages, execution environments, etc. for example) to estimate metric values. Those other applications themselves have to be tested and evaluated with metrics that can be simple or compound as this figure depicts. While 1st order property metrics are either simple or compound, higher order property metrics can be combinations of both.
  • Figure 5: This illustrates the concepts and derivations surrounding various possibilities for actual accuracy, $r$, of an LM in a higher-order testing scenario when the data being tested against has true accuracy $q$. In this context, when both the human and the model, which is scored again the human, are "wrong" the model response is actually correct. For example, if $p=0.9$ is the estimated model accuracy measured against data that has accuracy $q=0.95$ then the true value, $r$, will satisfy $0.9+0.95-1 = 0.85 \leq r \leq 0.9 + (1-0.95) = 0.95$. Under the independence assumption for model and human errors, $r=0.9\cdot 0.95=0.855$ which is almost at the lower bound for $r$.
  • ...and 4 more figures