Table of Contents
Fetching ...

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Ilya Gusev

TL;DR

This work introduces a benchmark for evaluating the role-playing capabilities of language models, and provides a foundation for a robust and dynamic evaluation of different model capabilities in interactive scenarios.

Abstract

We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues. Our methodology involves three main components: a player model that adopts a specific character role, an interrogator model that simulates user behavior in a specific situation, and a judge model ensemble that evaluates conversation quality with 3 metrics: character consistency, entertainment value, and language fluency. We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of different model capabilities in interactive scenarios.

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

TL;DR

This work introduces a benchmark for evaluating the role-playing capabilities of language models, and provides a foundation for a robust and dynamic evaluation of different model capabilities in interactive scenarios.

Abstract

We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues. Our methodology involves three main components: a player model that adopts a specific character role, an interrogator model that simulates user behavior in a specific situation, and a judge model ensemble that evaluates conversation quality with 3 metrics: character consistency, entertainment value, and language fluency. We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of different model capabilities in interactive scenarios.
Paper Structure (23 sections, 3 figures, 9 tables)

This paper contains 23 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: This diagram illustrates the flow of interactions in the proposed benchmark. There are three main components with different language models: a player, an interrogator, and a judge ensemble. The player assumes some character role, the interrogator acts as a user in a specific situation, and the judges evaluate final conversations.
  • Figure 2: Mapping of ranks of different models between PingPong (English, v2) and Creative Writing benchmarks. Colors signify different model families.
  • Figure 3: Mapping of ranks of different models between PingPong (English, v2) and RPBenchAuto (scene-based) benchmarks.