MacGyver: Are Large Language Models Creative Problem Solvers?
Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas L. Griffiths, Faeze Brahman
TL;DR
MacGyver introduces a large, carefully vetted dataset of 1,683 unconventional, real-world problems to probe creative problem solving in physical reasoning. It couples a progressive data-generation pipeline—driven by GPT-4 refinement and rigorous human verification—with comprehensive human and AI benchmarking across several LLMs. The study shows clear gaps between current LLMs and humans, highlights distinct error modes such as infeasible steps and tool hallucinations, and demonstrates that targeted prompting (self-reflection and divergent-convergent thinking) can meaningfully improve AI performance. The findings argue for complementary human-AI collaboration and point to future work in embodied evaluation and automatic assessment of physically grounded reasoning.
Abstract
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking. This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.
