Table of Contents
Fetching ...

Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors

Dane Williamson, Yangfeng Ji, Matthew Dwyer

TL;DR

This work identifies syntactic misalignment as a rational, schema-driven failure mode in LLM math reasoning, where unfamiliar surface structure triggers overreliance on learned templates. By quantifying syntactic complexity with Dependency Locality Theory (DLT) and normalizing it to $DLT_{norm}(q)$, the authors show higher syntactic burden predicts failures on several benchmarks. They propose a dependency-guided rephrasing pipeline that aligns incorrect questions with structurally similar correct ones using a Weisfeiler-Lehman graph kernel, reducing processing costs and improving accuracy across GSM8K, SVAMP, and other datasets without retraining. The findings suggest syntax-aware interventions, including rephrasing and curriculum-style exposure to varied structures, can meaningfully boost robustness and generalization in math reasoning tasks. This work offers a principled, cognitive-science–inspired framework for diagnosing and mitigating a substantial source of inductive failure in current LLMs.

Abstract

Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.

Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors

TL;DR

This work identifies syntactic misalignment as a rational, schema-driven failure mode in LLM math reasoning, where unfamiliar surface structure triggers overreliance on learned templates. By quantifying syntactic complexity with Dependency Locality Theory (DLT) and normalizing it to , the authors show higher syntactic burden predicts failures on several benchmarks. They propose a dependency-guided rephrasing pipeline that aligns incorrect questions with structurally similar correct ones using a Weisfeiler-Lehman graph kernel, reducing processing costs and improving accuracy across GSM8K, SVAMP, and other datasets without retraining. The findings suggest syntax-aware interventions, including rephrasing and curriculum-style exposure to varied structures, can meaningfully boost robustness and generalization in math reasoning tasks. This work offers a principled, cognitive-science–inspired framework for diagnosing and mitigating a substantial source of inductive failure in current LLMs.

Abstract

Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.

Paper Structure

This paper contains 32 sections, 6 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Structural rephrasing improves model accuracy by reducing syntactic complexity and dependency length.
  • Figure 2: Dependency parses illustrating the rephrasing pipeline. The rephrased version reduces dependency depth and referential interference, lowering DLT-based processing cost.
  • Figure 3: Format of rephrasing prompt. The LLM is prompted to generate a rephrased variant that more closely matches the surface structure of the correctly answered question.
  • Figure 4: Rephrasing pipeline. An incorrectly answered question is aligned to a syntactically similar, correctly answered one via WL Kernel matching. A $k$-shot prompt then guides the LLM to generate a syntactically aligned but semantically identical rephrasing.
  • Figure 5: DLT complexity scores by model outcome (correct vs. incorrect) across five LLMs on GSM8K. In each subplot, incorrectly answered questions (orange) exhibit higher mean complexity and greater variance than correct ones (green). Welch's t-statistics and p-values confirm these differences are statistically significant.
  • ...and 4 more figures