Table of Contents
Fetching ...

Efficient Mathematical Reasoning Models via Dynamic Pruning and Knowledge Distillation

Fengming Yu, Qingyu Meng, Haiwei Pan, Kejia Zhang

TL;DR

This work tackles the high computational burden of large Transformer-based mathematical reasoning models by introducing a lightweight framework that jointly performs dynamic attention head pruning and recursive knowledge distillation. It computes a head-importance score $S_{l,i} = \alpha \cdot w_{l,i} + (1-\alpha) \cdot H_{l,i}$ to selectively prune attention heads in real time, and uses a recursive distillation objective $\mathcal{L}_{total} = \lambda_{1} \mathcal{L}_{distill} + \lambda_{2} \mathcal{L}_{task}$ with $\mathcal{L}_{distill}$ combining KL divergence and inter-layer attention alignment. Experiments on Math23k and ASDiv-A with ATHENA-base and ATHENA-large show meaningful efficiency gains (e.g., up to ~18–19% parameter reductions and ~21–28% speedups) with minimal accuracy loss (as low as ~0.4–0.7 percentage points), demonstrating robust performance across model sizes and problem complexities. The method offers a practical route toward deploying mathematical reasoning models on resource-constrained devices, and points to future work integrating quantization and broader task settings.

Abstract

With the rapid development of deep learning, large language models have shown strong capabilities in complex reasoning tasks such as mathematical equation solving. However, their substantial computational and storage costs hinder practical deployment. This paper proposes a lightweight optimization method that integrates dynamic attention head pruning with knowledge distillation. The approach dynamically evaluates the importance of each attention head in the multi-head attention mechanism using a combination of weight norms and entropy, and prunes redundant heads in real time to reduce computational overhead. To mitigate performance degradation, knowledge distillation transfers information from the original model to the pruned student, enabling the smaller model to preserve reasoning ability. Experiments conducted on both Math23k and ASDiv-A verify the effectiveness of the proposed method. For example, on Math23k with a 30% pruning ratio, parameters are reduced by 18.7%, inference speed is improved by 27.5%, FLOPs are reduced by 19.3%, and accuracy drops only 0.7% (from 84.4% to 83.7%). These results demonstrate that the method achieves substantial efficiency gains while maintaining strong reasoning performance, providing a practical solution for efficient deployment of large language models in mathematical reasoning tasks.

Efficient Mathematical Reasoning Models via Dynamic Pruning and Knowledge Distillation

TL;DR

This work tackles the high computational burden of large Transformer-based mathematical reasoning models by introducing a lightweight framework that jointly performs dynamic attention head pruning and recursive knowledge distillation. It computes a head-importance score to selectively prune attention heads in real time, and uses a recursive distillation objective with combining KL divergence and inter-layer attention alignment. Experiments on Math23k and ASDiv-A with ATHENA-base and ATHENA-large show meaningful efficiency gains (e.g., up to ~18–19% parameter reductions and ~21–28% speedups) with minimal accuracy loss (as low as ~0.4–0.7 percentage points), demonstrating robust performance across model sizes and problem complexities. The method offers a practical route toward deploying mathematical reasoning models on resource-constrained devices, and points to future work integrating quantization and broader task settings.

Abstract

With the rapid development of deep learning, large language models have shown strong capabilities in complex reasoning tasks such as mathematical equation solving. However, their substantial computational and storage costs hinder practical deployment. This paper proposes a lightweight optimization method that integrates dynamic attention head pruning with knowledge distillation. The approach dynamically evaluates the importance of each attention head in the multi-head attention mechanism using a combination of weight norms and entropy, and prunes redundant heads in real time to reduce computational overhead. To mitigate performance degradation, knowledge distillation transfers information from the original model to the pruned student, enabling the smaller model to preserve reasoning ability. Experiments conducted on both Math23k and ASDiv-A verify the effectiveness of the proposed method. For example, on Math23k with a 30% pruning ratio, parameters are reduced by 18.7%, inference speed is improved by 27.5%, FLOPs are reduced by 19.3%, and accuracy drops only 0.7% (from 84.4% to 83.7%). These results demonstrate that the method achieves substantial efficiency gains while maintaining strong reasoning performance, providing a practical solution for efficient deployment of large language models in mathematical reasoning tasks.

Paper Structure

This paper contains 20 sections, 7 equations, 1 figure, 4 tables, 2 algorithms.

Figures (1)

  • Figure 1: Overview of the pruning and distillation framework. (a) The teacher model performs inference on the dataset to calculate the importance scores of each attention head, generating an importance score matrix; (b) Dynamic pruning based on importance score matrix; (c) The pruned model serves as student and is distilled from the unpruned teacher model. (d) Recursive knowledge distillation: each student serves as the teacher for the next stage until the target compression ratio is reached.