Table of Contents
Fetching ...

GLoRE: Evaluating Logical Reasoning of Large Language Models

Hanmeng liu, Zhiyang Teng, Ruoxi Ning, Yiran Ding, Xiulai Li, Xiaozhang Liu, Yue Zhang

TL;DR

GLoRE introduces a living, unified benchmark to evaluate the logical reasoning capabilities of instruction-tuned LLMs across zero-shot and few-shot settings, aggregating 12 datasets (72,848 instances) spanning MRC, NLI, and Yes/No tasks. The study benchmarks a wide range of models—from RoBERTa-base and open-source LLMs to ChatGPT, GPT-4, o1 mini, and QwQ-32B—using accuracy and human baselines, revealing that reasoning-enhanced models like QwQ-32B now achieve state-of-the-art results on several tasks. Findings show strong performance gains for specialized models on MRC and TF tasks, but substantial variability across datasets and with distribution shifts, highlighting robustness gaps in generalization. GLoRE is positioned as a dynamic platform that will continuously incorporate new data and models to track progress in logical reasoning for both commercial and open-source ecosystems.

Abstract

Large language models (LLMs) have shown significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a General Logical Reasoning Evaluation platform that not only consolidates diverse datasets but also standardizes them into a unified format suitable for evaluating large language models across zero-shot and few-shot scenarios. Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements, with QwQ-32B achieving the highest benchmark performance to date. GLoRE is designed as a living project that continuously integrates new datasets and models, facilitating robust and comparative assessments of model performance in both commercial and Huggingface communities.

GLoRE: Evaluating Logical Reasoning of Large Language Models

TL;DR

GLoRE introduces a living, unified benchmark to evaluate the logical reasoning capabilities of instruction-tuned LLMs across zero-shot and few-shot settings, aggregating 12 datasets (72,848 instances) spanning MRC, NLI, and Yes/No tasks. The study benchmarks a wide range of models—from RoBERTa-base and open-source LLMs to ChatGPT, GPT-4, o1 mini, and QwQ-32B—using accuracy and human baselines, revealing that reasoning-enhanced models like QwQ-32B now achieve state-of-the-art results on several tasks. Findings show strong performance gains for specialized models on MRC and TF tasks, but substantial variability across datasets and with distribution shifts, highlighting robustness gaps in generalization. GLoRE is positioned as a dynamic platform that will continuously incorporate new data and models to track progress in logical reasoning for both commercial and open-source ecosystems.

Abstract

Large language models (LLMs) have shown significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a General Logical Reasoning Evaluation platform that not only consolidates diverse datasets but also standardizes them into a unified format suitable for evaluating large language models across zero-shot and few-shot scenarios. Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements, with QwQ-32B achieving the highest benchmark performance to date. GLoRE is designed as a living project that continuously integrates new datasets and models, facilitating robust and comparative assessments of model performance in both commercial and Huggingface communities.
Paper Structure (11 sections, 1 figure, 3 tables)

This paper contains 11 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Instruction and question format for logical reading comprehension tasks.