LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Fahim Dalvi; Maram Hasanain; Sabri Boughorbel; Basel Mousi; Samir Abdaljalil; Nizi Nazar; Ahmed Abdelali; Shammur Absar Chowdhury; Hamdy Mubarak; Ahmed Ali; Majd Hawasly; Nadir Durrani; Firoj Alam

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Fahim Dalvi, Maram Hasanain, Sabri Boughorbel, Basel Mousi, Samir Abdaljalil, Nizi Nazar, Ahmed Abdelali, Shammur Absar Chowdhury, Hamdy Mubarak, Ahmed Ali, Majd Hawasly, Nadir Durrani, Firoj Alam

TL;DR

The LLMeBench framework is introduced, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language, and features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics.

Abstract

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language. The framework features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics. It supports in-context learning with zero- and few-shot settings. A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks. The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We open-sourced LLMeBench for the community (https://github.com/qcri/LLMeBench/) and a video demonstrating the framework is available online. (https://youtu.be/9cC2m_abk3A)

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

TL;DR

Abstract

Paper Structure (22 sections, 2 figures, 1 table)

This paper contains 22 sections, 2 figures, 1 table.

Introduction
LLMeBench
Model Provider module
Dataset module
Evaluation module
Benchmarking Asset module
Interaction
Features
Modularity
Generality
Prompts
Zero-shot prompts
Few-shot prompts
Caching
Dataset Auto-Download
...and 7 more sections

Figures (2)

Figure 1: The architecture of the LLMeBench framework. The dotted boxes represent the core implemented modules of the architecture. Customization for new tasks, datasets, and models can be done on Dataset, Model Provider, Evaluation, and Asset modules.
Figure 2: Summary and examples of the 53 datasets, 31 tasks, 4 model providers, 5 tested models and metrics currently implemented and validated in LLMeBench.

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

TL;DR

Abstract

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Authors

TL;DR

Abstract

Table of Contents

Figures (2)