Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Seungpil Lee; Woochang Sim; Donghyeon Shin; Wongyu Seo; Jiwon Park; Seokki Lee; Sanha Hwang; Sejin Kim; Sundong Kim

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Seungpil Lee, Woochang Sim, Donghyeon Shin, Wongyu Seo, Jiwon Park, Seokki Lee, Sanha Hwang, Sejin Kim, Sundong Kim

TL;DR

The main contribution of this article lies in introducing the LoTH perspective, which provides a method for evaluating the reasoning process that conventional results-oriented approaches fail to capture, thereby offering new insights into the development of human-level reasoning in artificial intelligence systems.

Abstract

The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been predominantly results-centric, making it challenging to assess the inference process comprehensively. We introduce a novel approach using the Abstraction and Reasoning Corpus (ARC) benchmark to evaluate the inference and contextual understanding abilities of LLMs in a process-centric manner, focusing on three key components from the Language of Thought Hypothesis (LoTH): Logical Coherence, Compositionality, and Productivity. Our carefully designed experiments reveal that while LLMs demonstrate some inference capabilities, they still significantly lag behind human-level reasoning in these three aspects. The main contribution of this paper lies in introducing the LoTH perspective, which provides a method for evaluating the reasoning process that conventional results-oriented approaches fail to capture, thereby offering new insights into the development of human-level reasoning in artificial intelligence systems.

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

TL;DR

Abstract

Paper Structure (55 sections, 2 equations, 17 figures, 7 tables, 1 algorithm)

This paper contains 55 sections, 2 equations, 17 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries
Limitation on Assessing Reasoning Ability of LLMs
Advantages of using ARC as Reasoning Benchmark
Core Properties of ARC
Flexibility in benchmark adaptation
Evaluating the Inferential Capabilities of LLMs Using the ARC Benchmark
Capability of LLMs 1: Logical Coherence
Motivation
Comparison Across Prompting Techniques
Inferential Coherence of LLMs
Case Study: Semantic Coherence of LLMs
Conclusion
Capability of LLMs 2: Compositionality
Motivation
...and 40 more sections

Figures (17)

Figure 1: Three different ARC tasks. Each task involves demonstration examples of input and output grids that exemplify the required transformation. Solvers must generate the correct output grid for the test example's input grid by applying the same transformation. ARC is a straightforward benchmark that can be solved using only four types of prior knowledge: objectness, goal-directedness, arithmetic, and geometric topology. Despite the small amount of prior knowledge required to solve the tasks, it presents a high level of reasoning difficulty. These characteristics enable ARC to serve as a benchmark that fairly measures reasoning abilities.
Figure 2: Three concepts of the Language of Thought Hypothesis (LoTH).
Figure 3: Three prompting techniques in the experiment about logical coherence: (a) CoT, (b) LtM, and (c) ToT.
Figure 4: Three types of prompts are shown on the left. Although all prompts are described as a 2D array of grids, we visualized them on the right for clarity. By default, all three techniques use prompts with two main components: a sample task and a target task. However, LtM and ToT use a different combination of the target task and its decomposition command. This difference arises because CoT strictly follows the given sub-task, while LtM and ToT decompose the task on their own.
Figure 5: Grey blocks illustrate prompt sets delivered to the LLM, including the sample task, target task, and LLM's prior responses, as shown in Fig. \ref{['fig:logical_coherence/Components_of_Prompting']}. Green blocks denote the final answer. CoT relies on a single grey block, indicating that the LLM strictly follows the provided sub-tasks. Conversely, LtM and ToT prompt the LLM to generate and address sub-tasks sequentially, represented by decomposed results (red) and intermediate responses (blue). ToT further distinguishes itself from LtM by evaluating multiple suggestions for sub-task handling and selecting the most effective one through a voting mechanism.
...and 12 more figures

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

TL;DR

Abstract

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Authors

TL;DR

Abstract

Table of Contents

Figures (17)