Table of Contents
Fetching ...

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman

Abstract

Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Abstract

Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

Paper Structure

This paper contains 67 sections, 2 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Left: Base decision scores $s(x)$. All positive (incorrect Walk preference); non-monotonic scaling. Right: Span-level occlusion heatmap. Distance columns uniformly blue ($\Delta s < 0$, toward Drive); goal columns near-zero or red.
  • Figure 2: Left: CSI vs. DSI per paraphrase (Qwen3-4B). Goal sensitivity drives HDR variation; distance sensitivity is stable. Right: Per-span $\Delta s$ heatmap (Qwen3-4B). Pattern consistent across all six models.
  • Figure 3: All six models' conflict curves (solid) are sigmoids tracking the control (dashed gray). No flat curve appears. Details in Appendix \ref{['app:study1']}.
  • Figure 4: Mean strict accuracy per H $\times$ C cell (14 models). C-pres hardest; C-cap easiest. Cells marked "---" are unpopulated (6 of 20 cells excluded for lack of natural scenarios; see Table \ref{['tab:hob_taxonomy']}).
  • Figure 5: Probe pattern classification (6 models $\times$ 4 probes). Correct: curves distinct; Partial: weak separation; Fail: sigmoid failure ($r > 0.8$).
  • ...and 10 more figures