Table of Contents
Fetching ...

Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task

Junjie Wu, Mo Yu, Lemao Liu, Dit-Yan Yeung, Jie Zhou

TL;DR

The study investigates fluid intelligence in large language models (LLMs) by using ARC, a benchmark for abstract inductive reasoning, and introduces the Abstraction and Reasoning on Atom Operation Corpus (ARAOC) to dissect ARC tasks into six atomic operations. Through multi-perspective experiments—including modality comparisons, model-size analyses, fine-tuning with LoRA, input-format transformations, and modeling-era considerations—the work identifies key bottlenecks: limited skill composition, unfamiliar abstract input formats, and a left-to-right autoregressive decoding bias that inhibits global reasoning. It demonstrates that while GPT-4o leads among tested models and atomic-operation fine-tuning can improve specific tasks, achieving human-like fluid intelligence remains elusive, especially for complex compositions and when encoding global information is required. The findings offer actionable directions for benchmark design, representation learning, and architectural changes to better capture abstraction and generalization in LLMs, with implications for tasks requiring novel problem-solving beyond memorized knowledge.

Abstract

While LLMs have exhibited strong performance on various NLP tasks, it is noteworthy that most of these tasks rely on utilizing the vast amount of knowledge encoded in LLMs' parameters, rather than solving new problems without prior knowledge. In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs' abilities. In this paper, we analyze the challenges LLMs face in demonstrating fluid intelligence through controlled experiments, using the most representative ARC task as an example. Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding. Our data and code can be found in https://wujunjie1998.github.io/araoc-benchmark.github.io/.

Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task

TL;DR

The study investigates fluid intelligence in large language models (LLMs) by using ARC, a benchmark for abstract inductive reasoning, and introduces the Abstraction and Reasoning on Atom Operation Corpus (ARAOC) to dissect ARC tasks into six atomic operations. Through multi-perspective experiments—including modality comparisons, model-size analyses, fine-tuning with LoRA, input-format transformations, and modeling-era considerations—the work identifies key bottlenecks: limited skill composition, unfamiliar abstract input formats, and a left-to-right autoregressive decoding bias that inhibits global reasoning. It demonstrates that while GPT-4o leads among tested models and atomic-operation fine-tuning can improve specific tasks, achieving human-like fluid intelligence remains elusive, especially for complex compositions and when encoding global information is required. The findings offer actionable directions for benchmark design, representation learning, and architectural changes to better capture abstraction and generalization in LLMs, with implications for tasks requiring novel problem-solving beyond memorized knowledge.

Abstract

While LLMs have exhibited strong performance on various NLP tasks, it is noteworthy that most of these tasks rely on utilizing the vast amount of knowledge encoded in LLMs' parameters, rather than solving new problems without prior knowledge. In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs' abilities. In this paper, we analyze the challenges LLMs face in demonstrating fluid intelligence through controlled experiments, using the most representative ARC task as an example. Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding. Our data and code can be found in https://wujunjie1998.github.io/araoc-benchmark.github.io/.

Paper Structure

This paper contains 42 sections, 9 figures, 22 tables.

Figures (9)

  • Figure 1: A saliency analysis example, where darker means higher saliency corresponds to the boxed token.
  • Figure 2: The standard prompt we use in this paper that converts ARC/ARAOC tasks into matrix-format inputs. Also the prompt for the textual input/textual output setting in Table \ref{['tab:different format']}.
  • Figure 3: The prompt for the visual input/visual output setting in Table \ref{['tab:different format']}.
  • Figure 4: The prompt for the visual+textual input/visual output setting in Table \ref{['tab:different format']}.
  • Figure 5: The prompt for the visual+textual input/textual output setting in Table \ref{['tab:different format']}.
  • ...and 4 more figures