Table of Contents
Fetching ...

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Jiajun Shi, Jian Yang, Jiaheng Liu, Xingyuan Bu, Jiangjie Chen, Junting Zhou, Kaijing Ma, Zhoufutu Wen, Bingli Wang, Yancheng He, Liang Song, Hualei Zhu, Shilong Li, Xingjian Wang, Wei Zhang, Ruibin Yuan, Yifan Yao, Wenjun Yang, Yunli Wang, Siyuan Fang, Siyu Yuan, Qianyu He, Xiangru Tang, Yingshui Tan, Wangchunshu Zhou, Zhaoxiang Zhang, Zhoujun Li, Wenhao Huang, Ge Zhang

TL;DR

KORGym introduces a dynamic, game-based benchmark to evaluate intrinsic LLM reasoning in a knowledge-orthogonal setting, addressing the limits of domain-specific benchmarks. It comprises over fifty textual and multimodal games across six reasoning dimensions and supports reinforcement-learning-enabled, multi-turn evaluation. Through a large-scale study of 19 LLMs and 8 VLMs, the paper reveals consistent strength–weakness profiles within model series, the superior performance of closed-source models, and the influence of modality, reasoning strategies, RL, and response length on performance. The authors propose a robust evaluation framework, including a per-game normalization and the Capability Dimension Aggregated Mean, and demonstrate that appropriate RL and explicit reasoning paradigms can enhance cross-dimension reasoning, offering a practical platform for advancing interactive LLM reasoning research.

Abstract

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

TL;DR

KORGym introduces a dynamic, game-based benchmark to evaluate intrinsic LLM reasoning in a knowledge-orthogonal setting, addressing the limits of domain-specific benchmarks. It comprises over fifty textual and multimodal games across six reasoning dimensions and supports reinforcement-learning-enabled, multi-turn evaluation. Through a large-scale study of 19 LLMs and 8 VLMs, the paper reveals consistent strength–weakness profiles within model series, the superior performance of closed-source models, and the influence of modality, reasoning strategies, RL, and response length on performance. The authors propose a robust evaluation framework, including a per-game normalization and the Capability Dimension Aggregated Mean, and demonstrate that appropriate RL and explicit reasoning paradigms can enhance cross-dimension reasoning, offering a practical platform for advancing interactive LLM reasoning research.

Abstract

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

Paper Structure

This paper contains 36 sections, 4 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Comparison Between Traditional Single-epoch Game Benchmark and KORGym.
  • Figure 2: Framework of the KORGym system. Our system architecture primarily consists of four modules: the Inference Module, Game Interaction Module, Evaluation Module, and Communication Module.The initialization parameters include: Game Name, Model Information, Seed, Deployment Port Number, and Output Directory.
  • Figure 3: Overview of the KORGym tasks. Our KORGym supports over 50 novel games, enabling precise and efficient evaluation of large language models (LLMs) across six distinct capability dimensions.
  • Figure 4: Capability Dimension Illustration. Figure (a) showcases the performance of the top-performing models on KORGym. Figure (b) showcases the impact of model scale and architecture on reasoning capabilities.
  • Figure 5: Performance Comparison Between Textual and Multimodal Game Versions. This figure illustrates a given model’s performance on both the textual and multimodal versions of the same game. Different games are represented by distinct bar colors, and bar shading differentiates text (unshaded) from visual (shaded) versions. Solid and dashed lines correspond to the average textual and visual scores, respectively.
  • ...and 3 more figures