Exploring the Mystery of Influential Data for Mathematical Reasoning

Xinzhe Ni; Yeyun Gong; Zhibin Gou; Yelong Shen; Yujiu Yang; Nan Duan; Weizhu Chen

Exploring the Mystery of Influential Data for Mathematical Reasoning

Xinzhe Ni, Yeyun Gong, Zhibin Gou, Yelong Shen, Yujiu Yang, Nan Duan, Weizhu Chen

TL;DR

This work addresses how to select influential data for fine-tuning large language models on mathematical reasoning tasks. It introduces QaDS, a Quality-aware Diverse Selection strategy that combines diversity via K-center Greedy with a data-quality score derived from one-shot influence measurements and a lightweight quality scorer. The authors construct OpenMathMix—a mixture of open-source data selected by QaDS—and achieve a state-of-the-art 48.8% accuracy on the MATH benchmark with a 7B base model. They also analyze data composition, show that scaling reasoning data helps, and demonstrate general data can enhance reasoning when selected accordingly, providing guidance for future open datasets.

Abstract

Selecting influential data for fine-tuning on downstream tasks is a key factor for both performance and computation efficiency. Recent works have shown that training with only limited data can show a superior performance on general tasks. However, the feasibility on mathematical reasoning tasks has not been validated. To go further, there exist two open questions for mathematical reasoning: how to select influential data and what is an influential data composition. For the former one, we propose a Quality-aware Diverse Selection (QaDS) strategy adaptable for mathematical reasoning. A comparison with other selection strategies validates the superiority of QaDS. For the latter one, we first enlarge our setting and explore the influential data composition. We conduct a series of experiments and highlight: scaling up reasoning data, and training with general data selected by QaDS is helpful. Then, we define our optimal mixture as OpenMathMix, an influential data mixture with open-source data selected by QaDS. With OpenMathMix, we achieve a state-of-the-art 48.8% accuracy on MATH with 7B base model. Additionally, we showcase the use of QaDS in creating efficient fine-tuning mixtures with various selection ratios, and analyze the quality of a wide range of open-source datasets, which can perform as a reference for future works on mathematical reasoning tasks.

Exploring the Mystery of Influential Data for Mathematical Reasoning

TL;DR

Abstract

Paper Structure (18 sections, 5 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 7 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Data Selection Strategy
Diversity Aspect
Quality Aspect
Quality-aware Diverse Selection (QaDS)
Experimental Setup
Experimental Results
Main Results on $\boldsymbol{S_{base}}$
Main Results on $\boldsymbol{S_{large}}$
Analysis
Conclusion
Implementation Details with Quality Aspect
Selection Scheme of $S_{base}$
Composition of Data in $S_{base}$
...and 3 more sections

Figures (7)

Figure 1: An overview of our proposed QaDS, including three parts from left to right. Left: A pipeline with diversity aspect. Middle: A pipeline with quality aspect. Right: Combining diversity aspect and quality aspect, QaDS selects influential data for fine-tuning.
Figure 2: Average accuracy in $S_{base-69K}$, $S_{base-130K}$ and $S_{base-203K}$ with LLaMA-2. The dark gray dashed line: performance with all 428K mathematical reasoning and general data. The light gray dashed line: performance with all 153K mathematical reasoning data.
Figure 3: Accuracy of 0.5M, 1.8M, 3.3M and 4.7M data of OpenMathMix with selection ratios of 10%, 40%, 70% and 100% by QaDS.
Figure 4: CoT accuracy comparison with 7B models on MATH in $S_{large}$. The bold font indicates the best result.
Figure 4: Pearson analysis between real quality scores and scorer quality scores.
...and 2 more figures

Exploring the Mystery of Influential Data for Mathematical Reasoning

TL;DR

Abstract

Exploring the Mystery of Influential Data for Mathematical Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)