Table of Contents
Fetching ...

Evaluating Large Language Models with fmeval

Pola Schwöbel, Luca Franceschi, Muhammad Bilal Zafar, Keerthan Vasist, Aman Malhotra, Tomer Shenhar, Pinal Tailor, Pinar Yilmaz, Michael Diamond, Michele Donini

TL;DR

fmeval presents an open-source framework for evaluating large language models across performance and responsible AI dimensions, unified under four design principles: Simplicity, Coverage, Extensibility, and Performance. The architecture separates data, model, evaluation, and reporting components, enabling distributed evaluation via Ray and AWS integration for streamlined MLOps. Built-in evaluations span classification, summarization, QA, and factual knowledge, plus bias, toxicity, and robustness checks, with BYO datasets and custom evaluations supported. A case study demonstrates using fmeval to compare QA models, including open-book variants, highlighting how multiple metrics reveal trade-offs between accuracy, robustness, and safety. The work concludes with limitations and directions for expanding coverage, multilingual support, guardrailing, and RAG-oriented evaluation in future iterations.

Abstract

fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.

Evaluating Large Language Models with fmeval

TL;DR

fmeval presents an open-source framework for evaluating large language models across performance and responsible AI dimensions, unified under four design principles: Simplicity, Coverage, Extensibility, and Performance. The architecture separates data, model, evaluation, and reporting components, enabling distributed evaluation via Ray and AWS integration for streamlined MLOps. Built-in evaluations span classification, summarization, QA, and factual knowledge, plus bias, toxicity, and robustness checks, with BYO datasets and custom evaluations supported. A case study demonstrates using fmeval to compare QA models, including open-book variants, highlighting how multiple metrics reveal trade-offs between accuracy, robustness, and safety. The work concludes with limitations and directions for expanding coverage, multilingual support, guardrailing, and RAG-oriented evaluation in future iterations.

Abstract

fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.
Paper Structure (75 sections, 2 equations, 11 figures, 5 tables)

This paper contains 75 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: High-level component interaction in FMEval. The user creates a ModelRunner and a DataConfig, and passes them to an implementation of EvalAlgorithmInterface. The evaluation algorithm loads data based on the DataConfig, executes the algorithm, and returns the result as an EvalOutput object. This can be visualized using the reporting module.
  • Figure 2: High-level system architecture of Amazon SageMaker FM Evaluations. A dataset and a configuration file serve as input to a batch processing job that produces evaluation results. These outputs are stored in a filesystem, and visualized in Amazon SageMaker Studio.
  • Figure 3: Built-in metrics for task accuracy and robustness in the QA evaluation (see § \ref{['sec:qa']}), on average over the built-in datasets. Toxicity is reported on the RealToxicityPrompts-Challenging subset (see § \ref{['sec:toxicity_datasets']}). See § \ref{['sec:detailed_results']} for per-dataset results.
  • Figure 4: Open-book QA Accuracy, higher is better ($\uparrow$).
  • Figure 5: QA Accuracy results on the three built-in datasets.
  • ...and 6 more figures