Table of Contents
Fetching ...

AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

Woosung Koh, Wonbeen Oh, Jaein Jang, MinHyung Lee, Hyeongjin Kim, Ah Yeon Kim, Joonkee Kim, Junghyun Lee, Taehyeon Kim, Se-Young Yun

TL;DR

AdaSTaR addresses inefficiencies in Self-Taught Reasoner (STaR) training by introducing two adaptive sampling mechanisms: Adaptive Sampling for Diversity (AdaD) and Adaptive Sampling for Curriculum (AdaC). AdaD uses a Diversity Statistic with a Hierarchical MinHeap to prioritize under-sampled and harder observations, while AdaC modulates sampling based on the current model strength $\alpha$ to mix easier data when needed. Across six reasoning benchmarks and multiple base models, AdaSTaR achieves the highest test accuracy while reducing training FLOPs by a substantial margin (average ~58.6%), and the ablation studies isolate the contributions of diversity and curriculum components. These results demonstrate a practical, low-overhead approach to more efficient self-improvement in reasoning LMs, with strong generalization to larger models and different families.

Abstract

Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

TL;DR

AdaSTaR addresses inefficiencies in Self-Taught Reasoner (STaR) training by introducing two adaptive sampling mechanisms: Adaptive Sampling for Diversity (AdaD) and Adaptive Sampling for Curriculum (AdaC). AdaD uses a Diversity Statistic with a Hierarchical MinHeap to prioritize under-sampled and harder observations, while AdaC modulates sampling based on the current model strength to mix easier data when needed. Across six reasoning benchmarks and multiple base models, AdaSTaR achieves the highest test accuracy while reducing training FLOPs by a substantial margin (average ~58.6%), and the ablation studies isolate the contributions of diversity and curriculum components. These results demonstrate a practical, low-overhead approach to more efficient self-improvement in reasoning LMs, with strong generalization to larger models and different families.

Abstract

Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

Paper Structure

This paper contains 47 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Average test accuracy and FLOPs across six datasets for Llama 3.2 3B and three datasets for Qwen 2.5 3B. Results consistently extend to Gemma 7B as well. *We use outcome verification on B-STaR for fair comparison. Thus, the implementation with process verification may perform significantly better.
  • Figure 2: High-level schematic diagram of AdaSTaR. Other STaR-like approaches are equivalent to this diagram, excluding the win statistic $w_i$ computation and the Adaptive Sampling module.
  • Figure 3: Empirical motivation for the need for adaptive sampling of diverse observations (a), regularized with curriculum learning (b).
  • Figure 4: AdaSTaR
  • Figure 5: Visualizing the entire learning curve for SVAMP on Llama 3.2 3B (left), Qwen 2.5 3B (center), and Gemma 7B (right). Each method's curve is charted up to its best (early-stopped) iteration. The highest test accuracy is marked as a star, and second best as a diamond. As some methods converge only after a significant amount of PFLOPs, for legibility of shorter curves, we use dashed lines, and annotate the precise PFLOPs cost on the chart.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Remark 1: Non-excessive sampling in line 7