Table of Contents
Fetching ...

ConceptSearch: Towards Efficient Program Search Using LLMs for Abstraction and Reasoning Corpus (ARC)

Kartik Singhal, Gautam Shroff

TL;DR

The paper tackles ARC's demand for broad generalization by introducing ConceptSearch, a function-search framework that uses LLMs to generate candidate DSL programs and concept-based scoring to guide search. It explores three scoring modalities (Hamming, CNN-based, and LLM-based NL) and demonstrates that concept-based guidance substantially improves task-solving efficiency and accuracy on ARC tasks, achieving up to 58% success with NL-based scoring compared to a 13% baseline from direct prompting. The approach highlights the value of embedding transformation concepts and language-grounded descriptions to direct program synthesis, offering a path toward more generalizable AI reasoning. Despite gains, ARC remains unsolved end-to-end, emphasizing the need for richer feedback mechanisms and scalable, multi-modal guidance to bridge the gap between human-like abstraction and machine capabilities.

Abstract

The Abstraction and Reasoning Corpus (ARC) poses a significant challenge to artificial intelligence, demanding broad generalization and few-shot learning capabilities that remain elusive for current deep learning methods, including large language models (LLMs). While LLMs excel in program synthesis, their direct application to ARC yields limited success. To address this, we introduce ConceptSearch, a novel function-search algorithm that leverages LLMs for program generation and employs a concept-based scoring method to guide the search efficiently. Unlike simplistic pixel-based metrics like Hamming distance, ConceptSearch evaluates programs on their ability to capture the underlying transformation concept reflected in the input-output examples. We explore three scoring functions: Hamming distance, a CNN-based scoring function, and an LLM-based natural language scoring function. Experimental results demonstrate the effectiveness of ConceptSearch, achieving a significant performance improvement over direct prompting with GPT-4. Moreover, our novel concept-based scoring exhibits up to 30% greater efficiency compared to Hamming distance, measured in terms of the number of iterations required to reach the correct solution. These findings highlight the potential of LLM-driven program search when integrated with concept-based guidance for tackling challenging generalization problems like ARC.

ConceptSearch: Towards Efficient Program Search Using LLMs for Abstraction and Reasoning Corpus (ARC)

TL;DR

The paper tackles ARC's demand for broad generalization by introducing ConceptSearch, a function-search framework that uses LLMs to generate candidate DSL programs and concept-based scoring to guide search. It explores three scoring modalities (Hamming, CNN-based, and LLM-based NL) and demonstrates that concept-based guidance substantially improves task-solving efficiency and accuracy on ARC tasks, achieving up to 58% success with NL-based scoring compared to a 13% baseline from direct prompting. The approach highlights the value of embedding transformation concepts and language-grounded descriptions to direct program synthesis, offering a path toward more generalizable AI reasoning. Despite gains, ARC remains unsolved end-to-end, emphasizing the need for richer feedback mechanisms and scalable, multi-modal guidance to bridge the gap between human-like abstraction and machine capabilities.

Abstract

The Abstraction and Reasoning Corpus (ARC) poses a significant challenge to artificial intelligence, demanding broad generalization and few-shot learning capabilities that remain elusive for current deep learning methods, including large language models (LLMs). While LLMs excel in program synthesis, their direct application to ARC yields limited success. To address this, we introduce ConceptSearch, a novel function-search algorithm that leverages LLMs for program generation and employs a concept-based scoring method to guide the search efficiently. Unlike simplistic pixel-based metrics like Hamming distance, ConceptSearch evaluates programs on their ability to capture the underlying transformation concept reflected in the input-output examples. We explore three scoring functions: Hamming distance, a CNN-based scoring function, and an LLM-based natural language scoring function. Experimental results demonstrate the effectiveness of ConceptSearch, achieving a significant performance improvement over direct prompting with GPT-4. Moreover, our novel concept-based scoring exhibits up to 30% greater efficiency compared to Hamming distance, measured in terms of the number of iterations required to reach the correct solution. These findings highlight the potential of LLM-driven program search when integrated with concept-based guidance for tackling challenging generalization problems like ARC.

Paper Structure

This paper contains 9 sections, 3 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Three sample ARC tasks, easily solvable by humans, yet unsolved by our proposed method as well as GPT-4 baseline xu2024llmsabstractionreasoningcorpus
  • Figure 2: Flowchart of the function-search algorithm, illustrating how programs in program database $P$ are evolved using scoring function $S$ in context of Abstraction and Reasoning Corpus.
  • Figure 3: Compact version of prompt used in program-generation step with two in-context program examples
  • Figure 4: Model architecture trained with classification loss and contrastive loss for learning meaningful task representations. The grid size can be as small as 1$\times$1, and cell occupancy is one-hot encoded into 10 channels, denoting 9 different colours and one for the cell being empty.
  • Figure 5: Comparing CNN-based and LLM-based scoring: one extracts features via CNN, while the other leverages LLM for natural language hypothesis, which is then converted to feature embedding by using SentenceTransformer.
  • ...and 2 more figures