AI Benchmarks and Datasets for LLM Evaluation

Todor Ivanov; Valeri Penchev

AI Benchmarks and Datasets for LLM Evaluation

Todor Ivanov, Valeri Penchev

TL;DR

The paper tackles the challenge of evaluating and improving the trustworthiness of large language models (LLMs) in the context of evolving EU regulation. It foregrounds Z-Inspection, the EU AI Act, and the COMPL-AI framework as complementary tools for rigorous, regulation-aligned benchmarking, and introduces a Bulgarian benchmarking initiative to collect and categorize AI benchmarks. It surveys a broad suite of benchmarks and datasets spanning knowledge, reasoning, safety, and multimodal tasks, illustrating that no current LLM fully satisfies the Act’s requirements. The work highlights the practical impact of standardized benchmarks for regulatory compliance, system safety, and trustworthiness, providing a foundation for researchers and practitioners to assess and improve LLMs across the AI lifecycle.

Abstract

LLMs demand significant computational resources for both pre-training and fine-tuning, requiring distributed computing capabilities due to their large model sizes \cite{sastry2024computing}. Their complex architecture poses challenges throughout the entire AI lifecycle, from data collection to deployment and monitoring \cite{OECD_AIlifecycle}. Addressing critical AI system challenges, such as explainability, corrigibility, interpretability, and hallucination, necessitates a systematic methodology and rigorous benchmarking \cite{guldimann2024complai}. To effectively improve AI systems, we must precisely identify systemic vulnerabilities through quantitative evaluation, bolstering system trustworthiness. The enactment of the EU AI Act \cite{EUAIAct} by the European Parliament on March 13, 2024, establishing the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems, further underscores the importance of tools and methodologies such as Z-Inspection. It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems. To this end, we have launched a project that is part of the AI Safety Bulgaria initiatives \cite{AI_Safety_Bulgaria}, aimed at collecting and categorizing AI benchmarks. This will enable practitioners to identify and utilize these benchmarks throughout the AI system lifecycle.

AI Benchmarks and Datasets for LLM Evaluation

TL;DR

Abstract

AI Benchmarks and Datasets for LLM Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents