Table of Contents
Fetching ...

PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

Souradeep Mukhopadhyay, Rishabh Baral, Nimeesh Mahajan, Samhitha Harish, Aswin RRV, Mihir Parmar, Mutsumi Nakamura, Chitta Baral

TL;DR

PHANTOM RECALL systematically probes whether LLMs truly re-reason when puzzle constraints shift, using a curated suite of 25 base logic puzzles and 149 perturbations. The authors introduce automated and manual diagnostics—a conceptual-equivalence judge, a detailed reasoning-error taxonomy, and a Step Error Classification auto-evaluator—along with a mirror-image dataset to test reasoning under restricted answer spaces. Evaluations across 11 models reveal a persistent phantom recall phenomenon: near-perfect base-puzzle accuracy collapses under perturbation, with models often reusing memorized solutions or spurious rationales. Prompt-based mitigation improves performance but does not eliminate fragility, highlighting a gap between linguistic fluency and genuine logical understanding and prompting future work on context-grounded reasoning.

Abstract

Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode--phantom recall--where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift--highlighting the gap between linguistic fluency and logical understanding.

PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

TL;DR

PHANTOM RECALL systematically probes whether LLMs truly re-reason when puzzle constraints shift, using a curated suite of 25 base logic puzzles and 149 perturbations. The authors introduce automated and manual diagnostics—a conceptual-equivalence judge, a detailed reasoning-error taxonomy, and a Step Error Classification auto-evaluator—along with a mirror-image dataset to test reasoning under restricted answer spaces. Evaluations across 11 models reveal a persistent phantom recall phenomenon: near-perfect base-puzzle accuracy collapses under perturbation, with models often reusing memorized solutions or spurious rationales. Prompt-based mitigation improves performance but does not eliminate fragility, highlighting a gap between linguistic fluency and genuine logical understanding and prompting future work on context-grounded reasoning.

Abstract

Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode--phantom recall--where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift--highlighting the gap between linguistic fluency and logical understanding.

Paper Structure

This paper contains 38 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: LLMs stumbled on this simple trivial puzzle variant--but a human can solve it instantly.
  • Figure 2: Distribution of standard puzzle and variations.
  • Figure 3: Grouped bar chart: Auto vs Human accuracy on closed source models
  • Figure 4: Overview of our three major contributions and their interconnections. The benchmark enables systematic reasoning chain analysis, which informs the development of error mitigation strategies. These strategies are validated on the benchmark, creating a comprehensive framework.
  • Figure 5: Performance of five different open source LLMs (LLama 3.1, Phi 4, Mistral 7B, Qwen 2.5 7B, and InternLM) in terms of accuracy on the Phantom Recall dataset.
  • ...and 16 more figures