AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
Woosung Koh, Wonbeen Oh, Jaein Jang, MinHyung Lee, Hyeongjin Kim, Ah Yeon Kim, Joonkee Kim, Junghyun Lee, Taehyeon Kim, Se-Young Yun
TL;DR
AdaSTaR addresses inefficiencies in Self-Taught Reasoner (STaR) training by introducing two adaptive sampling mechanisms: Adaptive Sampling for Diversity (AdaD) and Adaptive Sampling for Curriculum (AdaC). AdaD uses a Diversity Statistic with a Hierarchical MinHeap to prioritize under-sampled and harder observations, while AdaC modulates sampling based on the current model strength $\alpha$ to mix easier data when needed. Across six reasoning benchmarks and multiple base models, AdaSTaR achieves the highest test accuracy while reducing training FLOPs by a substantial margin (average ~58.6%), and the ablation studies isolate the contributions of diversity and curriculum components. These results demonstrate a practical, low-overhead approach to more efficient self-improvement in reasoning LMs, with strong generalization to larger models and different families.
Abstract
Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.
