Table of Contents
Fetching ...

A Diversity-Enhanced Knowledge Distillation Model for Practical Math Word Problem Solving

Yi Zhang, Guangyou Zhou, Zhiwen Xie, Jinjin Ma, Jimmy Xiangji Huang

TL;DR

This work tackles the limited diversity of solution equations in math word problem solvers by introducing DivKD, a diversity-enhanced knowledge distillation framework. It combines an adaptive, beam-search-guided knowledge distillation (AdaKD) with a diversity-prior student model built around a conditional variational autoencoder (CVAE) to capture the distribution of possible correct equations. The method demonstrates improved answer accuracy across four benchmarks (Math23K, MAWPS, MathQA, SVAMP) while maintaining efficiency comparable to single-decoder baselines, outperforming prior KD approaches that rely on multiple decoders. The results underscore DivKD’s practical value for producing diverse, high-quality solutions in real-world MWP applications, with potential extensions toward model compression and broader dataset evaluation.

Abstract

Math Word Problem (MWP) solving is a critical task in natural language processing, has garnered significant research interest in recent years. Various recent studies heavily rely on Seq2Seq models and their extensions (e.g., Seq2Tree and Graph2Tree) to generate mathematical equations. While effective, these models struggle to generate diverse but counterpart solution equations, limiting their generalization across various math problem scenarios. In this paper, we introduce a novel Diversity-enhanced Knowledge Distillation (DivKD) model for practical MWP solving. Our approach proposes an adaptive diversity distillation method, in which a student model learns diverse equations by selectively transferring high-quality knowledge from a teacher model. Additionally, we design a diversity prior-enhanced student model to better capture the diversity distribution of equations by incorporating a conditional variational auto-encoder. Extensive experiments on {four} MWP benchmark datasets demonstrate that our approach achieves higher answer accuracy than strong baselines while maintaining high efficiency for practical applications.

A Diversity-Enhanced Knowledge Distillation Model for Practical Math Word Problem Solving

TL;DR

This work tackles the limited diversity of solution equations in math word problem solvers by introducing DivKD, a diversity-enhanced knowledge distillation framework. It combines an adaptive, beam-search-guided knowledge distillation (AdaKD) with a diversity-prior student model built around a conditional variational autoencoder (CVAE) to capture the distribution of possible correct equations. The method demonstrates improved answer accuracy across four benchmarks (Math23K, MAWPS, MathQA, SVAMP) while maintaining efficiency comparable to single-decoder baselines, outperforming prior KD approaches that rely on multiple decoders. The results underscore DivKD’s practical value for producing diverse, high-quality solutions in real-world MWP applications, with potential extensions toward model compression and broader dataset evaluation.

Abstract

Math Word Problem (MWP) solving is a critical task in natural language processing, has garnered significant research interest in recent years. Various recent studies heavily rely on Seq2Seq models and their extensions (e.g., Seq2Tree and Graph2Tree) to generate mathematical equations. While effective, these models struggle to generate diverse but counterpart solution equations, limiting their generalization across various math problem scenarios. In this paper, we introduce a novel Diversity-enhanced Knowledge Distillation (DivKD) model for practical MWP solving. Our approach proposes an adaptive diversity distillation method, in which a student model learns diverse equations by selectively transferring high-quality knowledge from a teacher model. Additionally, we design a diversity prior-enhanced student model to better capture the diversity distribution of equations by incorporating a conditional variational auto-encoder. Extensive experiments on {four} MWP benchmark datasets demonstrate that our approach achieves higher answer accuracy than strong baselines while maintaining high efficiency for practical applications.
Paper Structure (24 sections, 15 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 15 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example of diversity equations and noise soft labels generated by the teacher model Graph2Tree-Z ZhangGraph2Tree2020, in which the symbols "$\times$" and "$\checkmark$" indicate the error and true answer.
  • Figure 2: The overview of the proposed DivKD model.
  • Figure 3: Comparison of testing time between baselines and our proposed methods (e.g., GTS+DivKD and Ro-Graph2TreeZ+DivKD) for Math23K and MAWPS datasets.
  • Figure 4: A quantitative analysis of generated correct expressions between basic models (e.g., Ro-GTS) and the student models (e.g., Ro-GTS+DivKD) on Math23K and MathQA datasets.
  • Figure 5: Two examples of the generated results on Math23K using GTS and the proposed GTS+DivKD.