Table of Contents
Fetching ...

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

Jiayi Gui, Yiming Liu, Jiale Cheng, Xiaotao Gu, Xiao Liu, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang

TL;DR

LogicGame introduces a rule-based reasoning benchmark for large language models, emphasizing execution and planning under strictly defined rules with deterministic intermediate steps. The dataset is built via a four-phase process, includes bilingual zh/en versions, and uses a structured JSON output with A-Acc, P-Acc, and AP-Acc scoring to automatically verify reasoning traces. Experiments across 14 diverse LLMs show substantial gaps in rule-based reasoning, with best performances around the mid-20s to mid-50s in AP-Acc depending on language and task, and notable differences between execution and planning. The work highlights the importance of evaluating both outcomes and reasoning processes, demonstrates the challenges models face on Reversi-like tasks, and calls for advances in coherent, rule-constrained reasoning in future systems.

Abstract

Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

TL;DR

LogicGame introduces a rule-based reasoning benchmark for large language models, emphasizing execution and planning under strictly defined rules with deterministic intermediate steps. The dataset is built via a four-phase process, includes bilingual zh/en versions, and uses a structured JSON output with A-Acc, P-Acc, and AP-Acc scoring to automatically verify reasoning traces. Experiments across 14 diverse LLMs show substantial gaps in rule-based reasoning, with best performances around the mid-20s to mid-50s in AP-Acc depending on language and task, and notable differences between execution and planning. The work highlights the importance of evaluating both outcomes and reasoning processes, demonstrates the challenges models face on Reversi-like tasks, and calls for advances in coherent, rule-constrained reasoning in future systems.

Abstract

Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
Paper Structure (25 sections, 1 equation, 10 figures, 4 tables)

This paper contains 25 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Evaluation results and demonstrations of LogicGame. (Bottom) Case study on two examples from execution and planning category respectively. (Top) Performance of various models across execution and planning categories. The performance is arithmetic mean of LogicGame's Chinese and English version. Most models struggle on LogicGame getting less than 12% scores in both categories. Two top-performing models highlighted with pink stars stand out.
  • Figure 2: Illustration of taxonomy and evaluation protocol in LogicGame. Taxonomy illustration highlights categories involving mathematics in purple. Json format constrain in evaluation is ommitted due to space limitations and can be referred to Appendix \ref{['sec:json prompt']}.
  • Figure 3: Performance comparison of 14 models on LogicGame measured by AP-Acc for both Chinese (zh) and English (en) versions.
  • Figure 4: Few-shot differences on execution and planning category of LogicGame's zh version. "shot_diff_1_0" represents the difference in the P-Acc score between the 1-shot and 0-shot settings, calculated as the result of 1-shot minus the result of 0-shot, "shot_diff_2_0" representing the P-Acc score between the 2-shot and 0-shot settings similarly.
  • Figure 5: Few-shot differences on difficulty levels of LogicGame's zh version with shot difference settings similar with Figure \ref{['fig:category_diff_fewshot']}.
  • ...and 5 more figures