A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

Jordan Meadows; Marco Valentino; Damien Teney; Andre Freitas

A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

Jordan Meadows, Marco Valentino, Damien Teney, Andre Freitas

TL;DR

This work introduces a symbolic‑engine‑driven framework to generate and perturb detailed mathematical derivations at scale, enabling controlled assessment of Transformer generalisation on multi‑step reasoning. It instantiates the framework on two sequence‑classification tasks and compares GPT‑4, GPT‑3.5, and fine‑tuned BERT variants, showing that perturbations can cause large drops in accuracy for encoders while GPT‑4 demonstrates strong generalisation, sometimes matching or surpassing in‑distribution performance. The findings highlight weaknesses in decoding indirect references to mathematical entities shared by BERT and GPT and suggest that structured derivation dependencies can elevate open‑source models toward GPT‑level performance. The authors release the codebase, datasets, and fine‑tuned models to spur further progress in symbolic reasoning and robust evaluation of mathematical generalisation in quantitative domains.

Abstract

This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field.

A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

TL;DR

Abstract

Paper Structure (22 sections, 5 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 5 figures, 6 tables, 1 algorithm.

Related Work
Generating and Perturbing Derivations with Symbolic Engines
Premise Generation
Derivation Generation
Perturbations
Sequence Classification Tasks
Evaluation
Relating Operators to Model Generalisability via Pairwise Analysis
Conclusion
Limitations
Overall ethical impact.
Chain-of-Thought.
Derivation generation.
Integration.
Fine-tuning BERT and prompting GPT
...and 7 more sections

Figures (5)

Figure 1: We present a framework for generating and perturbing high-quality mathematical derivations at scale to systematically evaluate mathematical reasoning and generalisation in Transformers.
Figure 2: Example perturbations applied to a generated derivation using computer algebra.
Figure 3: $\tilde{N}_P$ is the percentage of operators present in examples where models fail to generalise to perturbations. The leftmost displays how this proportion varies as a function of operator rank. The rightmost graph factors in static performance $(S)$ and generalisability $(G)$ scores for a clearer comparison of models.
Figure 4: Sampled data from each binary sequence classification task. In short, a sequence containing reasoning context, an instruction annotation, and resulting math is input to a model. The model then predicts whether the math follows from the context and annotation, and if the sequence is mathematically coherent (1) or not (0).
Figure 5: Three examples of the total 15 where both SciBERT and MathBERT correctly classify unperturbed examples (as shown), but incorrectly classify all perturbed examples.

A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

TL;DR

Abstract

A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)