Table of Contents
Fetching ...

TextArena

Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan

TL;DR

TextArena addresses the need for evaluating emergent social and strategic skills in large language model agents by providing a competitive, text-based game platform. It introduces a Gym-like, extensible framework that supports 57+ environments across single-, two-, and multi-player settings, with online evaluation via a TrueSkill leaderboard. Soft-skill profiling and model-vs-model as well as model-vs-human evaluation offer nuanced, relative performance insights beyond static benchmarks. By enabling self-play-driven data generation and community-driven expansion, TextArena aims to catalyze scalable evaluation and training of agentic reasoning in LLMs.

Abstract

TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.

TextArena

TL;DR

TextArena addresses the need for evaluating emergent social and strategic skills in large language model agents by providing a competitive, text-based game platform. It introduces a Gym-like, extensible framework that supports 57+ environments across single-, two-, and multi-player settings, with online evaluation via a TrueSkill leaderboard. Soft-skill profiling and model-vs-model as well as model-vs-human evaluation offer nuanced, relative performance insights beyond static benchmarks. By enabling self-play-driven data generation and community-driven expansion, TextArena aims to catalyze scalable evaluation and training of agentic reasoning in LLMs.

Abstract

TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.

Paper Structure

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: TextArena Soft-skill comparison. Frontier models and Humanity are compared across ten key skills. Each skill is normalised separately for presentation; see the leaderboard for full data.
  • Figure 2: Preliminary model rankings for a subset of models and games. Game-play results are influenced by both the models' ability to play the games and their ability to understand the rules and format. For example, some reasoning models can sometimes reveal their cards or roles during game-play.
  • Figure 3: Images of some (rendered) TextArena environments.