Distilling Mathematical Reasoning Capabilities into Small Language Models

Xunyu Zhu; Jian Li; Yong Liu; Can Ma; Weiping Wang

Distilling Mathematical Reasoning Capabilities into Small Language Models

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang

TL;DR

This work tackles democratizing mathematical reasoning by distilling LLM capabilities into sub-billion parameter SLMs. It introduces Equation-of-Thought Distillation (EoTD), which encodes reasoning as equations solved by an external solver, and Ensemble Thoughts Distillation (ETD), which combines CoT, PoT, and EoT to create a diverse, multi-form reasoning dataset for fine-tuning. Empirical results across GSM8K, ASDiv, SVAMP, and MultiArith show EoTD significantly improves SLM reasoning, while ETD delivers state-of-the-art performance across model sizes, with larger SLMs benefiting more from diverse thought forms. The approach offers a pathway to deploy capable mathematical reasoning tools on resource-constrained hardware, with potential extensions beyond mathematics to broader reasoning tasks.

Abstract

This work addresses the challenge of democratizing advanced Large Language Models (LLMs) by compressing their mathematical reasoning capabilities into sub-billion parameter Small Language Models (SLMs) without compromising performance. We introduce Equation-of-Thought Distillation (EoTD), a novel technique that encapsulates the reasoning process into equation-based representations to construct an EoTD dataset for fine-tuning SLMs. Additionally, we propose the Ensemble Thoughts Distillation (ETD) framework to enhance the reasoning performance of SLMs. This involves creating a reasoning dataset with multiple thought processes, including Chain-of-Thought (CoT), Program-of-Thought (PoT), and Equation-of-Thought (EoT), and using it for fine-tuning. Our experimental performance demonstrates that EoTD significantly boosts the reasoning abilities of SLMs, while ETD enables these models to achieve state-of-the-art reasoning performance.

Distilling Mathematical Reasoning Capabilities into Small Language Models

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 9 figures, 3 tables)

This paper contains 26 sections, 7 equations, 9 figures, 3 tables.

Introduction
Related Work
Large Language Models (LLMs)
Mathematical Reasoning
Knowledge Distillation
Methodology
Equation-of-Thought Distillation
Data Generation from LLMs
Fine-tuning SLMs
Ensemble Thoughts Distillation
Experiments
Dataset
Implementation
Baselines
Main Results
...and 11 more sections

Figures (9)

Figure 1: A particular case where SLMs under CoTD and PoTD fail to generate the correct answer, but SLMs under EoTD successfully solve the question.
Figure 2: Detailed data generation of our framework. Firstly, we manually construct some contextualized examples, and combine these contextualized examples, the question, and the prompt "System of linear equations: (Do not simplify)" to prompt LLMs to generate EoT based on the question. This equations system is sent to a deterministic equation solver, if there are compile errors or if it produces wrong answer, we will drop the EoT. Finally, we get a high-quality reasoning dataset.
Figure 3: Detailed overview of Ensemble Thought Distillation. Firstly, we combine a CoT dataset, a PoT dataset and a EoT dataset to build a new ETD dataset. The ETD dataset has diverse thoughts and prompts. Then, we use the ETD dataset to fine-tune SLMs. After fine-tuning, we use the prompt "System of linear equations: (Do not simplify)" to instruct SLMs to generate equations, the prompt "Let’s break down the code step by step" to instruct SLMs to generate programs, and the prompt "Let's think step by step" to instruct SLMs to generate chains to solve questions.
Figure 4: Effect of ETD. We fine-tune SLMs on the ETD dataset, the CoTD dataset, the PoTD dataset, and the EoTD dataset to study the effect of ETD. The experiment results shows that ETD can improve reasoning performance of SLMs under different thoughts.
Figure 5: Effect of Data Scale. We fine-tune CodeT5-Base under different data sizes to evaluate the effect of data scale. The experiment results show that larger data size make SLMs better reasoning performance.
...and 4 more figures

Distilling Mathematical Reasoning Capabilities into Small Language Models

TL;DR

Abstract

Distilling Mathematical Reasoning Capabilities into Small Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)