Table of Contents
Fetching ...

Intelligence Analysis of Language Models

Liane Galanti, Ethan Baron

TL;DR

ARC benchmarks probe abstract and visual reasoning with Core Knowledge priors to compare human and machine capabilities. The study evaluates open-source LLMs (e.g., LLaMA, Phind, Mixtral) on ARC tasks using textual encodings and two prompting regimes: Zero-shot and Chain-of-Thought (CoT). Results show extremely limited success, with a maximum of $2/50$ tasks solved, and CoT occasionally degrades performance or yields flawed reasoning, underscoring a gap to Artificial General Intelligence. The work highlights the need for novel methods and prompts beyond standard LLMs, and provides a reproducible framework with a public GitHub repository.

Abstract

In this project, we test the effectiveness of Large Language Models (LLMs) on the Abstraction and Reasoning Corpus (ARC) dataset. This dataset serves as a representative benchmark for testing abstract reasoning abilities, requiring a fundamental understanding of key concepts such as object identification, basic counting, and elementary geometric principles. Tasks from this dataset are converted into a prompt-based format for evaluation. Initially, we assess the models' potential through a Zero-shot approach. Subsequently, we investigate the application of the Chain-of-Thought (CoT) technique, aiming to determine its role in improving model performance. Our results suggest that, despite the high expectations placed on contemporary LLMs, these models still struggle in non-linguistic domains, even when dealing with simpler subsets of the ARC dataset. Our study is the first to concentrate on the capabilities of open-source models in this context. The code, dataset, and prompts supporting this project's findings can be found in our GitHub repository, accessible at: https://github.com/Lianga2000/LLMsOnARC.

Intelligence Analysis of Language Models

TL;DR

ARC benchmarks probe abstract and visual reasoning with Core Knowledge priors to compare human and machine capabilities. The study evaluates open-source LLMs (e.g., LLaMA, Phind, Mixtral) on ARC tasks using textual encodings and two prompting regimes: Zero-shot and Chain-of-Thought (CoT). Results show extremely limited success, with a maximum of tasks solved, and CoT occasionally degrades performance or yields flawed reasoning, underscoring a gap to Artificial General Intelligence. The work highlights the need for novel methods and prompts beyond standard LLMs, and provides a reproducible framework with a public GitHub repository.

Abstract

In this project, we test the effectiveness of Large Language Models (LLMs) on the Abstraction and Reasoning Corpus (ARC) dataset. This dataset serves as a representative benchmark for testing abstract reasoning abilities, requiring a fundamental understanding of key concepts such as object identification, basic counting, and elementary geometric principles. Tasks from this dataset are converted into a prompt-based format for evaluation. Initially, we assess the models' potential through a Zero-shot approach. Subsequently, we investigate the application of the Chain-of-Thought (CoT) technique, aiming to determine its role in improving model performance. Our results suggest that, despite the high expectations placed on contemporary LLMs, these models still struggle in non-linguistic domains, even when dealing with simpler subsets of the ARC dataset. Our study is the first to concentrate on the capabilities of open-source models in this context. The code, dataset, and prompts supporting this project's findings can be found in our GitHub repository, accessible at: https://github.com/Lianga2000/LLMsOnARC.
Paper Structure (16 sections, 2 figures, 4 tables)