Table of Contents
Fetching ...

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Anne Beyer, Kranti Chalamalasetti, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, David Schlangen

TL;DR

Clembench offers a dynamic, self-play based benchmark for evaluating LLMs as multi-action agents, combining game templates, parsing rules, and a GameMaster to produce interactive episodes that are scored along formatting adherence and gameplay quality. The framework demonstrates stability to data contamination, scalability to many models including open-weight options, and measurable correlations with interactive benchmarks, while exposing clear gaps between human expert performance and current models. A multilingual case study shows the framework's capability to probe instruction-following across languages, supporting broader evaluation of cross-lingual capabilities. Overall, clembench serves as a practical, extensible tool for model selection and for exploring future directions such as RL-based learning environments and development workflows for goal-directed agents.

Abstract

It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such game-play environments, and further test its usefulness as an evaluation instrument, along a number of dimensions: We show that it can easily keep up with new developments while avoiding data contamination, we show that the tests implemented within it are not yet saturated (human performance is substantially higher than that of even the best models), and we show that it lends itself to investigating additional questions, such as the impact of the prompting language on performance. We believe that the approach forms a good basis for making decisions on model choice for building applied interactive systems, and perhaps ultimately setting up a closed-loop development environment of system and simulated evaluator.

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

TL;DR

Clembench offers a dynamic, self-play based benchmark for evaluating LLMs as multi-action agents, combining game templates, parsing rules, and a GameMaster to produce interactive episodes that are scored along formatting adherence and gameplay quality. The framework demonstrates stability to data contamination, scalability to many models including open-weight options, and measurable correlations with interactive benchmarks, while exposing clear gaps between human expert performance and current models. A multilingual case study shows the framework's capability to probe instruction-following across languages, supporting broader evaluation of cross-lingual capabilities. Overall, clembench serves as a practical, extensible tool for model selection and for exploring future directions such as RL-based learning environments and development workflows for goal-directed agents.

Abstract

It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such game-play environments, and further test its usefulness as an evaluation instrument, along a number of dimensions: We show that it can easily keep up with new developments while avoiding data contamination, we show that the tests implemented within it are not yet saturated (human performance is substantially higher than that of even the best models), and we show that it lends itself to investigating additional questions, such as the impact of the prompting language on performance. We believe that the approach forms a good basis for making decisions on model choice for building applied interactive systems, and perhaps ultimately setting up a closed-loop development environment of system and simulated evaluator.
Paper Structure (11 sections, 1 figure, 3 tables)

This paper contains 11 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Top: Bump chart showing ranking differences between clembench (left) and Chatbot Arena (2024-05-16; right); Bottom: Ranking differences between clembench (left) and HELM (v1.3.0; right)