Table of Contents
Fetching ...

MacGyver: Are Large Language Models Creative Problem Solvers?

Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas L. Griffiths, Faeze Brahman

TL;DR

MacGyver introduces a large, carefully vetted dataset of 1,683 unconventional, real-world problems to probe creative problem solving in physical reasoning. It couples a progressive data-generation pipeline—driven by GPT-4 refinement and rigorous human verification—with comprehensive human and AI benchmarking across several LLMs. The study shows clear gaps between current LLMs and humans, highlights distinct error modes such as infeasible steps and tool hallucinations, and demonstrates that targeted prompting (self-reflection and divergent-convergent thinking) can meaningfully improve AI performance. The findings argue for complementary human-AI collaboration and point to future work in embodied evaluation and automatic assessment of physically grounded reasoning.

Abstract

We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking. This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.

MacGyver: Are Large Language Models Creative Problem Solvers?

TL;DR

MacGyver introduces a large, carefully vetted dataset of 1,683 unconventional, real-world problems to probe creative problem solving in physical reasoning. It couples a progressive data-generation pipeline—driven by GPT-4 refinement and rigorous human verification—with comprehensive human and AI benchmarking across several LLMs. The study shows clear gaps between current LLMs and humans, highlights distinct error modes such as infeasible steps and tool hallucinations, and demonstrates that targeted prompting (self-reflection and divergent-convergent thinking) can meaningfully improve AI performance. The findings argue for complementary human-AI collaboration and point to future work in embodied evaluation and automatic assessment of physically grounded reasoning.

Abstract

We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking. This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.
Paper Structure (61 sections, 24 figures, 8 tables)

This paper contains 61 sections, 24 figures, 8 tables.

Figures (24)

  • Figure 1: Examples of the problems in our MacGyver dataset with the GPT-4 and human answers (continued in Figure \ref{['fig:teaser_more']}). Pictures, drawn by DALL$\cdot$E 3, are solely for illustration purposes and may not accurately reflect the text. In our experiment, all inputs to human and LLMs are natural language texts.
  • Figure 2: Progressive problem refinement with GPT-4. Starting from a vanilla version (i.e., Iteration 1), we carefully design refinement steps that gradually increase the problem's complexity by adding specific object properties as constraints to veto a previous solution (i.e., Iteration 2), and adding distracting objects that are (likely) not involved in the solution the problem (i.e., Iteration 3). After that, human verifiers judge the quality of refined problems.
  • Figure 3: Affordances of the presented tools in our MacGyver dataset and their frequency (and count). Note that one object may have multiple affordances (e.g., paddle boards can be used for boating, reaching high areas, and exercise).
  • Figure 4: Left: Human-evaluated GPT-4 performance on all 1,306 problems from the MacGyver that humans think are solvable. Right: GPT-4 performance on all 377 problems that humans think are unsolvable. Correct for the right reason means that the LLM correctly identifies the problem is unsolvable, and gives the right justification. Correct for the wrong reason means that it correctly identifies the problem is unsolvable, but gives an incorrect justification.
  • Figure 5: Left: Benchmark results of seven LLMs and human with a single effort. For human participants, since there is no single participant who worked on all problems, we take a random response from each problem. We color-code the three categories indicating fine-grained aspects of correctness or falseness. Right: Comparison between GPT-4 and human where we evaluated multiple solutions per problem. The best performance, which can be viewed as an upper bound, is computed by taking the individual best answer (out of 6) for each problem. The actual numbers are reported in Table \ref{['Table:benchmark-results']} in \ref{['appendix:benchmark']}.
  • ...and 19 more figures