Table of Contents
Fetching ...

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, Junyang Lin

TL;DR

AutoLogi tackles the vulnerability and limited discriminability of traditional reasoning benchmarks by automatically generating open-ended logic puzzles evaluated with program-based verifiers. The method assembles a bilingual English-Chinese benchmark through a three-stage pipeline of information extraction, verifier generation, and data augmentation, all under a cross-validated framework. It demonstrates superior discrimination across eight modern LLMs and enables effective data-driven training via rejection sampling for SFT and DPO, improving performance on independent reasoning benchmarks and showcasing notable cross-domain generalization. The work also provides insights into language-agnostic reasoning and the scaling behavior of optimization methods, while acknowledging limitations due to reliance on LLMs and imperfect verifiers.

Abstract

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

TL;DR

AutoLogi tackles the vulnerability and limited discriminability of traditional reasoning benchmarks by automatically generating open-ended logic puzzles evaluated with program-based verifiers. The method assembles a bilingual English-Chinese benchmark through a three-stage pipeline of information extraction, verifier generation, and data augmentation, all under a cross-validated framework. It demonstrates superior discrimination across eight modern LLMs and enables effective data-driven training via rejection sampling for SFT and DPO, improving performance on independent reasoning benchmarks and showcasing notable cross-domain generalization. The work also provides insights into language-agnostic reasoning and the scaling behavior of optimization methods, while acknowledging limitations due to reliance on LLMs and imperfect verifiers.

Abstract

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.

Paper Structure

This paper contains 44 sections, 25 figures, 10 tables.

Figures (25)

  • Figure 1: Comparison of evaluation processes between multiple-choice questions and our method. While multiple-choice questions may allow underperforming models to guess the correct answer, our method generates open generative questions, and utilize a verification function to validate the generated solution, providing a more accurate reflection of model performance.
  • Figure 2: An overview of our method. The process consists of three stages: Stage 1 formulates logic puzzles by extracting background information and constraints from a source corpus. Stage 2 uses large language models (LLMs) to generate verifiers, which are programs that check puzzle solutions and ensure correct formatting. Stage 3 augments the puzzles by adding or removing constraints to create varying difficulty levels. All three stages leverage powerful LLMs, such as GPT-4, for generation.
  • Figure 3: The question quantities on the Chinese subset of AutoLogi before and after data augmentation, and the accuracy of eight models across different constraints and solution space proportions. The figure of English subset can be found in Appendix \ref{['appendix:en']}.
  • Figure 4: The precision and recall of evaluations using our verification function (Program-based Verifier) and GPT-4 as the evaluator (LLM Judger). True/False indicates the ground truth correctness of the answer. Positive/Negative is the output label predicted by Program-based Verifier or LLM Judger.
  • Figure 5: An example of the LLM Judger making mistakes in verifying model responses.
  • ...and 20 more figures