More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning
Yike Zhao, Simin Guo, Ziqing Yang, Shifan Han, Dahua Lin, Fei Tan
TL;DR
The paper tackles whether increasing data volume improves mathematical reasoning in industrial LLM pipelines and proposes a unified evaluation framework that mirrors training and deployment. Through systematic analysis of open-source datasets and data synthesis methods across pretraining and SFT, it shows that better data—especially structured, interpretable formats and distillation from stronger models—outperforms volume alone, with techniques like NaturalReasoning-QwQ and Math-Cosmo delivering notable gains. It also reveals that noisy web data and indiscriminate data mixing can degrade performance, while RL-inspired and curated strategies hold promise for scalable improvements. The work provides actionable guidelines for cost-effective data curation and practical model enhancement, bridging research and real-world deployment in mathematical reasoning, and highlights future directions including RL-like data synthesis and refined curriculum designs.
Abstract
The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.
