ConceptSearch: Towards Efficient Program Search Using LLMs for Abstraction and Reasoning Corpus (ARC)
Kartik Singhal, Gautam Shroff
TL;DR
The paper tackles ARC's demand for broad generalization by introducing ConceptSearch, a function-search framework that uses LLMs to generate candidate DSL programs and concept-based scoring to guide search. It explores three scoring modalities (Hamming, CNN-based, and LLM-based NL) and demonstrates that concept-based guidance substantially improves task-solving efficiency and accuracy on ARC tasks, achieving up to 58% success with NL-based scoring compared to a 13% baseline from direct prompting. The approach highlights the value of embedding transformation concepts and language-grounded descriptions to direct program synthesis, offering a path toward more generalizable AI reasoning. Despite gains, ARC remains unsolved end-to-end, emphasizing the need for richer feedback mechanisms and scalable, multi-modal guidance to bridge the gap between human-like abstraction and machine capabilities.
Abstract
The Abstraction and Reasoning Corpus (ARC) poses a significant challenge to artificial intelligence, demanding broad generalization and few-shot learning capabilities that remain elusive for current deep learning methods, including large language models (LLMs). While LLMs excel in program synthesis, their direct application to ARC yields limited success. To address this, we introduce ConceptSearch, a novel function-search algorithm that leverages LLMs for program generation and employs a concept-based scoring method to guide the search efficiently. Unlike simplistic pixel-based metrics like Hamming distance, ConceptSearch evaluates programs on their ability to capture the underlying transformation concept reflected in the input-output examples. We explore three scoring functions: Hamming distance, a CNN-based scoring function, and an LLM-based natural language scoring function. Experimental results demonstrate the effectiveness of ConceptSearch, achieving a significant performance improvement over direct prompting with GPT-4. Moreover, our novel concept-based scoring exhibits up to 30% greater efficiency compared to Hamming distance, measured in terms of the number of iterations required to reach the correct solution. These findings highlight the potential of LLM-driven program search when integrated with concept-based guidance for tackling challenging generalization problems like ARC.
