Table of Contents
Fetching ...

From Reasoning to Generalization: Knowledge-Augmented LLMs for ARC Benchmark

Chao Lei, Nir Lipovetzky, Krista A. Ehinger, Yanchuan Chang

TL;DR

This work investigates reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC), recasting ARC as a program-synthesis task and evaluating nine solvers that vary in generation strategy and representation. It introduces KAAR, a knowledge-augmentation framework that encodes core ARC priors in a three-level ontology and augments LLM reasoning progressively, enabling stage-wise reasoning and improved generalization. Across multiple LLMs, KAAR consistently boosts performance over a planning-backed solver backbone (RSPC), achieving up to around 5% absolute gains and up to 64.52% relative improvements, while maintaining broad generalization. The study confirms ARC remains challenging and highlights the potential of structured, dependency-aware knowledge priors to advance abstract reasoning and generalization in LLMs, with implications for transfer to related domains like robotic task planning and visual reasoning.

Abstract

Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.

From Reasoning to Generalization: Knowledge-Augmented LLMs for ARC Benchmark

TL;DR

This work investigates reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC), recasting ARC as a program-synthesis task and evaluating nine solvers that vary in generation strategy and representation. It introduces KAAR, a knowledge-augmentation framework that encodes core ARC priors in a three-level ontology and augments LLM reasoning progressively, enabling stage-wise reasoning and improved generalization. Across multiple LLMs, KAAR consistently boosts performance over a planning-backed solver backbone (RSPC), achieving up to around 5% absolute gains and up to 64.52% relative improvements, while maintaining broad generalization. The study confirms ARC remains challenging and highlights the potential of structured, dependency-aware knowledge priors to advance abstract reasoning and generalization in LLMs, with implications for transfer to related domains like robotic task planning and visual reasoning.

Abstract

Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.

Paper Structure

This paper contains 21 sections, 23 figures, 8 tables, 1 algorithm.

Figures (23)

  • Figure 1: An ARC problem example (25ff71a9) with image visualizations (a), including three input-output pairs in the training instances, and one input image in the test instance, along with their corresponding 2D matrix representations (b). The ground-truth test output is enclosed in a red box.
  • Figure 2: An illustration of the three ARC solution generation approaches, (1) direct generation, (2) repeated sampling, and (3) refinement, with the GPT-o3-mini input and response fragments (a–c) for solving task 25ff71a9 (Figure \ref{['fig1']}). For each approach, when the solution $s$ is code, $s := c$, a plan $p$ is either generated from the problem description $Q$ to guide code generation (planning-aided) or omitted (standalone). Otherwise, when $s := p$, the plan $p$ serves as the final solution instead.
  • Figure 3: The example of goal-directedness priors augmentation in KAAR with input and response fragments from GPT-o3-mini.
  • Figure 4: Augmentation process in KAAR (block (b)) and the corresponding knowledge augmentation fragments (blocks (c-e)) for ARC problem 62ab2642 (block (a)).
  • Figure 5: Asymmetric relative coverage matrices for RSPC (a) and KAAR (b), showing the proportion of problems whose test instances are solved by the row model that are also solved by the column model, across four LLMs.
  • ...and 18 more figures