Table of Contents
Fetching ...

PersonaMath: Boosting Mathematical Reasoning via Persona-Driven Data Augmentation

Jing Luo, Longze Chen, Run Luo, Liang Zhu, Chang Ao, Jiaming Li, Yukun Chen, Xin Cheng, Wen Yang, Jiayuan Su, Ahmadreza Argha, Hamid Alinejad-Rokny, Chengming Li, Shiwen Ni, Min Yang

TL;DR

The paper tackles the open-source vs. closed-source gap in mathematical reasoning by introducing PersonaMathQA, a persona-driven data augmentation dataset derived from MATH and GSM8K. It employs a two-stage pipeline: Stage 1 uses a closed LLM to generate detailed CoT and rewrites questions across 11 ISCO-08 occupation-based personas to diversify data; Stage 2 uses reflection on misanswered items to regenerate corrected CoT with increased emphasis on hard problems. Fine-tuning open-source models on PersonaMathQA yields state-of-the-art results on MATH and GSM8K (e.g., PersonaMath-7B reaching 61.2% and 87.8%), despite the dataset being smaller than some baselines. The approach demonstrates data efficiency, introduces occupation-based persona classification for data diversity, and publicly releases the dataset, models, and code.

Abstract

While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models still face challenges with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage focuses on learning from Persona Diversification, and the second stage emphasizes learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a persona-driven data augmentation technique. This technique innovatively classifies personas based on occupations, significantly enhancing the dataset's diversity and quality. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on Qwen2.5-7B) achieves an accuracy of 61.2% on MATH and 87.8% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 128.9K data points-merely 32.6% of MetaMathQA and 49.5% of MathInstruct-yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. We open-source the PersonaMathQA dataset, PersonaMath models, and our code for public usage.

PersonaMath: Boosting Mathematical Reasoning via Persona-Driven Data Augmentation

TL;DR

The paper tackles the open-source vs. closed-source gap in mathematical reasoning by introducing PersonaMathQA, a persona-driven data augmentation dataset derived from MATH and GSM8K. It employs a two-stage pipeline: Stage 1 uses a closed LLM to generate detailed CoT and rewrites questions across 11 ISCO-08 occupation-based personas to diversify data; Stage 2 uses reflection on misanswered items to regenerate corrected CoT with increased emphasis on hard problems. Fine-tuning open-source models on PersonaMathQA yields state-of-the-art results on MATH and GSM8K (e.g., PersonaMath-7B reaching 61.2% and 87.8%), despite the dataset being smaller than some baselines. The approach demonstrates data efficiency, introduces occupation-based persona classification for data diversity, and publicly releases the dataset, models, and code.

Abstract

While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models still face challenges with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage focuses on learning from Persona Diversification, and the second stage emphasizes learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a persona-driven data augmentation technique. This technique innovatively classifies personas based on occupations, significantly enhancing the dataset's diversity and quality. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on Qwen2.5-7B) achieves an accuracy of 61.2% on MATH and 87.8% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 128.9K data points-merely 32.6% of MetaMathQA and 49.5% of MathInstruct-yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. We open-source the PersonaMathQA dataset, PersonaMath models, and our code for public usage.
Paper Structure (32 sections, 4 figures, 2 tables)

This paper contains 32 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The framework of our data augmentation method. The method consists of two stages: Stage 1 (top) and Stage 2 (bottom). Stage 1 focuses on using closed-source LLMs to automatically generate detailed CoT solutions and apply our persona-driven rewriting method to rephrase the questions. Stage 2 focuses on reflection. The data from both stages are then combined to form our PersonaMathQA dataset.
  • Figure 2: The superior performance of our PersonaMath models in comparison to other models. Among all models of the same size, our model achieves the highest test accuracy, demonstrating state-of-the-art performance.
  • Figure 3: Comparison of Word Types and TTR between our PersonaMathQA dataset and MetaMathQA. PersonaMathQA significantly surpasses MetaMathQA in both metrics, demonstrating its superior diversity and quality.
  • Figure 4: Comparison of the distribution of question lengths between our dataset and the two baseline datasets, where "Original" refers to the sum of the MATH and GSM8K datasets. The result shows that the distribution of question lengths in our dataset is more uniform and broader than in the two baseline datasets, indicating superior diversity.