Table of Contents
Fetching ...

Reverse That Number! Decoding Order Matters in Arithmetic Learning

Daniel Zhang-Li, Nianyi Lin, Jifan Yu, Zheyuan Zhang, Zijun Yao, Xiaokang Zhang, Lei Hou, Jing Zhang, Juanzi Li

TL;DR

This work introduces a novel strategy that not only reevaluates the digit order by prioritizing output from the least significant digit but also incorporates a step-by-step methodology to substantially reduce complexity.

Abstract

Recent advancements in pretraining have demonstrated that modern Large Language Models (LLMs) possess the capability to effectively learn arithmetic operations. However, despite acknowledging the significance of digit order in arithmetic computation, current methodologies predominantly rely on sequential, step-by-step approaches for teaching LLMs arithmetic, resulting in a conclusion where obtaining better performance involves fine-grained step-by-step. Diverging from this conventional path, our work introduces a novel strategy that not only reevaluates the digit order by prioritizing output from the least significant digit but also incorporates a step-by-step methodology to substantially reduce complexity. We have developed and applied this method in a comprehensive set of experiments. Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training. For the purpose of facilitating replication and further research, we have made our code and dataset publicly available at \url{https://anonymous.4open.science/r/RAIT-9FB7/}.

Reverse That Number! Decoding Order Matters in Arithmetic Learning

TL;DR

This work introduces a novel strategy that not only reevaluates the digit order by prioritizing output from the least significant digit but also incorporates a step-by-step methodology to substantially reduce complexity.

Abstract

Recent advancements in pretraining have demonstrated that modern Large Language Models (LLMs) possess the capability to effectively learn arithmetic operations. However, despite acknowledging the significance of digit order in arithmetic computation, current methodologies predominantly rely on sequential, step-by-step approaches for teaching LLMs arithmetic, resulting in a conclusion where obtaining better performance involves fine-grained step-by-step. Diverging from this conventional path, our work introduces a novel strategy that not only reevaluates the digit order by prioritizing output from the least significant digit but also incorporates a step-by-step methodology to substantially reduce complexity. We have developed and applied this method in a comprehensive set of experiments. Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training. For the purpose of facilitating replication and further research, we have made our code and dataset publicly available at \url{https://anonymous.4open.science/r/RAIT-9FB7/}.
Paper Structure (33 sections, 6 equations, 4 figures, 3 tables)

This paper contains 33 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Reversing the numbers in training enables models to better learn to do arithmetic operations.
  • Figure 2: Example training data for Multiplication. Where the task is solved using a step-by-step process. During the $i$th intermediate step, the intermediate product is first computed. Then, inspired by the human process, we set the least significant digits($U_{high}$) unchanged and directly added the product to the remaining digits($U_{low}$) of the cumulative sum. Finally, we pop the least significant digit from the updated $U_{high}$ and append it into $U_{low}$ as it will not be added with non-zero digits in later steps. During decoding, we express all numbers in Little-Endian, where the least significant digit goes first. We convert all the numbers back to Big-Endian before printing.
  • Figure 3: Performance when integrating step-by-step. BE stands for Big-Endian and LE stands for Little-Endian. The graph on the left shows the results after training on addition. The the right figure shows results for trained and evaluated on subtraction.
  • Figure 4: Visualization of attention weights during inference, with rows representing output tokens and columns indicating input tokens involved in generation. Attention weights are square-root transformed for enhanced visibility of correlations. The attention on the left(layer $14$) reveals output digits are correlate with their inputs, while attention(right) from layer $22$ suggests carry information reconstruction.