Table of Contents
Fetching ...

Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

Benjamin Grando Moreira

TL;DR

This study examines whether fifteen large language models (LLMs) and a cohort of eighty humans can perform logical and abstract reasoning on eight bespoke tasks. It adopts a qualitative assessment framework, focusing on answer correctness and the validity of reasoning rather than standard metrics. Results show that while some LLMs solve straightforward problems, many struggle with deeper inference, pattern discovery, and language-switch rules, underscoring gaps in genuine reasoning compared to humans. The work highlights the need for improved prompting and benchmarking to better reveal reasoning capabilities in AI systems and informs future directions for evaluation and design of reasoning-centric AI.

Abstract

Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabiá - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.

Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

TL;DR

This study examines whether fifteen large language models (LLMs) and a cohort of eighty humans can perform logical and abstract reasoning on eight bespoke tasks. It adopts a qualitative assessment framework, focusing on answer correctness and the validity of reasoning rather than standard metrics. Results show that while some LLMs solve straightforward problems, many struggle with deeper inference, pattern discovery, and language-switch rules, underscoring gaps in genuine reasoning compared to humans. The work highlights the need for improved prompting and benchmarking to better reveal reasoning capabilities in AI systems and informs future directions for evaluation and design of reasoning-centric AI.

Abstract

Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabiá - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.

Paper Structure

This paper contains 9 sections, 3 tables.