Evaluating Large Language Models with fmeval
Pola Schwöbel, Luca Franceschi, Muhammad Bilal Zafar, Keerthan Vasist, Aman Malhotra, Tomer Shenhar, Pinal Tailor, Pinar Yilmaz, Michael Diamond, Michele Donini
TL;DR
fmeval presents an open-source framework for evaluating large language models across performance and responsible AI dimensions, unified under four design principles: Simplicity, Coverage, Extensibility, and Performance. The architecture separates data, model, evaluation, and reporting components, enabling distributed evaluation via Ray and AWS integration for streamlined MLOps. Built-in evaluations span classification, summarization, QA, and factual knowledge, plus bias, toxicity, and robustness checks, with BYO datasets and custom evaluations supported. A case study demonstrates using fmeval to compare QA models, including open-book variants, highlighting how multiple metrics reveal trade-offs between accuracy, robustness, and safety. The work concludes with limitations and directions for expanding coverage, multilingual support, guardrailing, and RAG-oriented evaluation in future iterations.
Abstract
fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.
