TEL'M: Test and Evaluation of Language Models
George Cybenko, Joshua Ackerman, Paul Lintilhac
TL;DR
TEL'M proposes a five-ingredient framework for principled LM evaluation that emphasizes task concreteness, observable properties, and rigorous measurement design. It introduces a formal metric architecture with simple, compound, and higher-order property metrics, underpinned by probabilistic guarantees, and discusses design and execution practices to enable reproducible, end-use–driven assessments. The approach aims to move LM evaluation toward rigorous, industrial-grade testing akin to healthcare and defense, with a parity-case study illustrating practical reporting. Collectively, TEL'M offers a structured path to quantify LM capabilities and limitations for high-value commercial, government, and national security applications, enabling more reliable, transferable insights across evolving AI systems.
Abstract
Language Models have demonstrated remarkable capabilities on some tasks while failing dramatically on others. The situation has generated considerable interest in understanding and comparing the capabilities of various Language Models (LMs) but those efforts have been largely ad hoc with results that are often little more than anecdotal. This is in stark contrast with testing and evaluation processes used in healthcare, radar signal processing, and other defense areas. In this paper, we describe Test and Evaluation of Language Models (TEL'M) as a principled approach for assessing the value of current and future LMs focused on high-value commercial, government and national security applications. We believe that this methodology could be applied to other Artificial Intelligence (AI) technologies as part of the larger goal of "industrializing" AI.
