Case-Based or Rule-Based: How Do Transformers Do the Math?

Yi Hu; Xiaojuan Tang; Haotong Yang; Muhan Zhang

Case-Based or Rule-Based: How Do Transformers Do the Math?

Yi Hu, Xiaojuan Tang, Haotong Yang, Muhan Zhang

TL;DR

A Rule-Following Fine-Tuning technique is proposed to teach transformers to perform rule-based reasoning and successfully enables LLMs fine-tuned on 1-5 digit addition to generalize to up to 12-digit addition with over 95% accuracy, which is over 40% higher than scratchpad.

Abstract

Despite the impressive performance in a variety of complex tasks, modern large language models (LLMs) still have trouble dealing with some math problems that are simple and intuitive for humans, such as addition. While we can easily learn basic rules of addition and apply them to new problems of any length, LLMs struggle to do the same. Instead, they may rely on similar cases seen in the training corpus for help. We define these two different reasoning mechanisms as "rule-based reasoning" and "case-based reasoning". Since rule-based reasoning is essential for acquiring systematic generalization ability, we aim to explore exactly whether transformers use rule-based or case-based reasoning for math problems. Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason. To mitigate such problems, we propose a Rule-Following Fine-Tuning (RFFT) technique to teach transformers to perform rule-based reasoning. Specifically, we provide explicit rules in the input and then instruct transformers to recite and follow the rules step by step. Through RFFT, we successfully enable LLMs fine-tuned on 1-5 digit addition to generalize to up to 12-digit addition with over 95% accuracy, which is over 40% higher than scratchpad. The significant improvement demonstrates that teaching LLMs to use rules explicitly helps them learn rule-based reasoning and generalize better in length.

Case-Based or Rule-Based: How Do Transformers Do the Math?

TL;DR

Abstract

Paper Structure (75 sections, 1 equation, 30 figures, 20 tables)

This paper contains 75 sections, 1 equation, 30 figures, 20 tables.

Introduction
Related Work
LLM reasoning.
Memorization or generalization.
Grokking.
Theoretical expressiveness.
Length generalization.
Case-based and Rule-based Reasoning
Case-based Reasoning.
Rule-based Reasoning.
Transformers are Doing Case-based Reasoning
Experimental Setup
Datasets
Models
Method
...and 60 more sections

Figures (30)

Figure 1: Illustrations of case-based and rule-based reasoning.
Figure 2: Accuracy of Leave-Square-Out method on addition, modular addition, base addition, and linear regression. The vertical and horizontal axes are $a$ and $b$, respectively. The area inside red boxes represents the test squares. During generation, we set the model temperature to 1 and sample 10 generations to evaluate the accuracy on each test point. We only leave one test square out in this experiment. The square center $(a_k, b_k)$ is (50, 50) for addition, base addition and linear regression and (56, 56) for modular addition.
Figure 3: We randomly select 3 centers of test squares $(a_k, b_k)$ and corresponding lengths $l_k$ ranging from 20 to 40 to see whether the locations and the side lengths affect the case-based reasoning behavior for datasets including addition, modular addition, base addition and linear regression. The area inside red boxes represents the test squares. We sample 10 generations at each data point and report the accuracy. The figure shows that holes consistently appear with the locations and side lengths of the test squares varying.
Figure 4: Test accuracy distribution of GPT-2 trained with scratchpad in the task of addition. Note that all points in the figure are test samples; each subfigure here corresponds to a left-out square in the original plane. From left to right, the side length of test square is set to $l_k=10, 20, 30, 40$. For each test point, we sample 10 generations and show the accuracy of generating the correct answer.
Figure 5: In the task of addition, we show the average accuracy over all test samples (samples within the square) with side length $l_k=10, 20, 30, 40$. We test four models: GPT-2, GPT-2 with scratchpad, GPT-2-medium and GPT-2-medium with scratchpad.
...and 25 more figures

Case-Based or Rule-Based: How Do Transformers Do the Math?

TL;DR

Abstract

Case-Based or Rule-Based: How Do Transformers Do the Math?

Authors

TL;DR

Abstract

Table of Contents

Figures (30)