Table of Contents
Fetching ...

Learning Beyond Pattern Matching? Assaying Mathematical Understanding in LLMs

Siyuan Guo, Aniket Didolkar, Nan Rosemary Ke, Anirudh Goyal, Ferenc Huszár, Bernhard Schölkopf

TL;DR

The paper investigates whether LLMs truly understand mathematics beyond pattern-matching by evaluating learning-to-learn behavior. It introduces NTKEval, an NTK-inspired protocol that measures changes in the output distribution $p_ heta( ext{correct}| ext{prompt})$ as models are trained on skill-focused data, using a kernel $k(s,s')$ over math skills. Results show that in-context learning differentiates deep mathematical structures from surface formats, indicating domain understanding, whereas instruction-tuning often yields uniform, format-driven improvements. With synthetic datasets and KhanSkill across Codellama, Llemma, and Mistral models, NTKEval demonstrates sample-efficient detection of learning effects and reveals qualitative differences between ICL and IT in exploiting structure. These findings inform the design of transparent, learning-to-learn capable scientific assistants and clarify when LLMs genuinely grasp mathematical structure versus relying on surface cues.

Abstract

We are beginning to see progress in language model assisted scientific discovery. Motivated by the use of LLMs as a general scientific assistant, this paper assesses the domain knowledge of LLMs through its understanding of different mathematical skills required to solve problems. In particular, we look at not just what the pre-trained model already knows, but how it learned to learn from information during in-context learning or instruction-tuning through exploiting the complex knowledge structure within mathematics. Motivated by the Neural Tangent Kernel (NTK), we propose \textit{NTKEval} to assess changes in LLM's probability distribution via training on different kinds of math data. Our systematic analysis finds evidence of domain understanding during in-context learning. By contrast, certain instruction-tuning leads to similar performance changes irrespective of training on different data, suggesting a lack of domain understanding across different skills.

Learning Beyond Pattern Matching? Assaying Mathematical Understanding in LLMs

TL;DR

The paper investigates whether LLMs truly understand mathematics beyond pattern-matching by evaluating learning-to-learn behavior. It introduces NTKEval, an NTK-inspired protocol that measures changes in the output distribution as models are trained on skill-focused data, using a kernel over math skills. Results show that in-context learning differentiates deep mathematical structures from surface formats, indicating domain understanding, whereas instruction-tuning often yields uniform, format-driven improvements. With synthetic datasets and KhanSkill across Codellama, Llemma, and Mistral models, NTKEval demonstrates sample-efficient detection of learning effects and reveals qualitative differences between ICL and IT in exploiting structure. These findings inform the design of transparent, learning-to-learn capable scientific assistants and clarify when LLMs genuinely grasp mathematical structure versus relying on surface cues.

Abstract

We are beginning to see progress in language model assisted scientific discovery. Motivated by the use of LLMs as a general scientific assistant, this paper assesses the domain knowledge of LLMs through its understanding of different mathematical skills required to solve problems. In particular, we look at not just what the pre-trained model already knows, but how it learned to learn from information during in-context learning or instruction-tuning through exploiting the complex knowledge structure within mathematics. Motivated by the Neural Tangent Kernel (NTK), we propose \textit{NTKEval} to assess changes in LLM's probability distribution via training on different kinds of math data. Our systematic analysis finds evidence of domain understanding during in-context learning. By contrast, certain instruction-tuning leads to similar performance changes irrespective of training on different data, suggesting a lack of domain understanding across different skills.
Paper Structure (17 sections, 9 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 9 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Example questions from synthetic dataset. Left shows math skill progression from elementary to complex skills and right shows a list of presentation formats (i.e. surface structures) while the question tests the same deep math skill.
  • Figure 2: Illustration of sample efficiency of NTKEval for Codellama-7b. The matrix shows either accuracy difference or difference in probability evaluated on row-specific skills between models instruction-tuned on column-specific skills and the base model. Left: Accuracy difference with $200$ generations per test question; Middle: Probability difference computed by NTKEval with $200$ generations per test question; Right: Converged accuracy difference with $5000$ generations per test question. Green indicates positive and red indicates negative values. This shows that NTKEval requires fewer generations compared to counting accuracy differences to capture changes in language model via instruction-tuning.
  • Figure 3: Accuracy difference between targeted skill prompting and standard prompting when in-context examples grouped by column-specified skills and evaluated on row-specified skills with base model as Codellama-7b (Left), Llemma-7b (Middle) and Mistral-7b (Right). We observe Llemma-7b (LLM tailored for Mathematics) displays the most clear positive diagonal line, suggesting it is good at differentiating the targeted math skill from the other relevant but misleading skills in ICL. Here green indicates positive values and red indicates negative values.
  • Figure 4: NTK matrix records changes in probability between the model instruction-tuned on column-specified skill and the base model evaluated on row-specified skill, where the base models are CodeLlama 7b (Left), Llemma-7b (Middle) and Mistral 7b (Right). Green indicates positive and red indicates negative values. We observe that instruction-tuning on column-specified skill datasets displays a positive diagonal line for the majority if not all skills, confirming that training on a targeted skill improves using the skill at test time.
  • Figure 5: Top: ICL accuracy difference on targeted (top left) and off-diagonal (top right) skill prompting compared to standard prompting; Bottom: Average difference in probability of generating correct solutions when instruction-tuning on targeted (left) and off-diagonal (right) skills compared to the base model. The x-axis displays the difficulty level of individual skills (measured by accuracy under 8-shot random in-context examples). In-context learning is able to differentiate the targeted skill from off-diagonal skills through the clear relative accuracy improvement, whereas instruction-tuning shows similar performance improvement irrespective of training skills.
  • ...and 4 more figures