UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs

Chaoqun He; Renjie Luo; Shengding Hu; Yuanqian Zhao; Jie Zhou; Hanghao Wu; Jiajie Zhang; Xu Han; Zhiyuan Liu; Maosong Sun

UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs

Chaoqun He, Renjie Luo, Shengding Hu, Yuanqian Zhao, Jie Zhou, Hanghao Wu, Jiajie Zhang, Xu Han, Zhiyuan Liu, Maosong Sun

TL;DR

The paper addresses the fragmentation of LLM evaluation pipelines by proposing UltraEval, a lightweight, modular framework that decouples models, data, and metrics and exposes models as HTTP-enabled services for flexible deployment. The approach combines data standardization, prompt templating, and a scalable inference engine (vLLM with Gunicorn) to support 59 benchmarks and diverse task types. It provides post-processing and a suite of automatic and human evaluation methods, including GPT-4 as a discriminator, to enable robust scoring across tasks. The results demonstrate reproducible benchmarking across popular models like Llama2 and Mistral, with alignment to published results, and the framework offers practical benefits for rapid, extensible evaluation, with plans to expand multimodal and RAG capabilities.

Abstract

Evaluation is pivotal for refining Large Language Models (LLMs), pinpointing their capabilities, and guiding enhancements. The rapid development of LLMs calls for a lightweight and easy-to-use framework for swift evaluation deployment. However, considering various implementation details, developing a comprehensive evaluation platform is never easy. Existing platforms are often complex and poorly modularized, hindering seamless incorporation into research workflows. This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. We identify and reimplement three core components of model evaluation (models, data, and metrics). The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow. Additionally, UltraEval supports diverse models owing to a unified HTTP service and provides sufficient inference acceleration. UltraEval is now available for researchers publicly.

UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs

TL;DR

Abstract

UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)