Table of Contents
Fetching ...

Skill-Targeted Adaptive Training

Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora

TL;DR

Skill-Targeted Adaptive Training (STAT) addresses saturation in supervised fine-tuning for math tasks by leveraging a frontier LLM as a teacher to identify task-specific skills and monitor a Missing-Skill-Profile for the student. The method proceeds in three stages: (1) detect difficult questions with reward filtering, (2) infer missing skills via a frontier teacher, and (3) construct a targeted training set by reweighting or synthesizing data aligned with the identified skills. Empirical results across Llama and Qwen on MATH and several OOD benchmarks show substantial gains over naive SFT, with average improvements up to ~6–7% and notable out-of-distribution gains; STAT also complements RL-based approaches like GRPO, enabling further performance gains. The findings indicate that explicitly targeting core skill gaps—especially basic algebra and computation—can generalize beyond the source data and support continual adaptation to new evaluation settings. The work provides a practical data-construction protocol and releases code to facilitate reproducibility and extension to other domains.

Abstract

Language models often show little to no improvement (i.e., "saturation") when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.

Skill-Targeted Adaptive Training

TL;DR

Skill-Targeted Adaptive Training (STAT) addresses saturation in supervised fine-tuning for math tasks by leveraging a frontier LLM as a teacher to identify task-specific skills and monitor a Missing-Skill-Profile for the student. The method proceeds in three stages: (1) detect difficult questions with reward filtering, (2) infer missing skills via a frontier teacher, and (3) construct a targeted training set by reweighting or synthesizing data aligned with the identified skills. Empirical results across Llama and Qwen on MATH and several OOD benchmarks show substantial gains over naive SFT, with average improvements up to ~6–7% and notable out-of-distribution gains; STAT also complements RL-based approaches like GRPO, enabling further performance gains. The findings indicate that explicitly targeting core skill gaps—especially basic algebra and computation—can generalize beyond the source data and support continual adaptation to new evaluation settings. The work provides a practical data-construction protocol and releases code to facilitate reproducibility and extension to other domains.

Abstract

Language models often show little to no improvement (i.e., "saturation") when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.

Paper Structure

This paper contains 42 sections, 6 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: is a three-stage skill-based data selection/generation method for supervised fine-tuning (SFT). Stage 1: Identify difficult questions for each model using reward filtering on model responses. Stage 2: Use frontier LLMs to analyze the model responses and build a model-specific . Stage 3: Use a pre-constructed to map the missing skill distribution to a training question distribution, which constitutes the data. synthesizes new questions using frontier LLMs targeted to the missing skills.
  • Figure 2: Comparison among the Top 10 frequent skills present in , , and questions selected on . The skills emphasized in both baselines, and , align poorly with the actual Top 10 missing skills of the model (i.e., skills in ). Furthermore, the missing skills are not necessarily those most common in the original data distribution, as shown by the skill distribution of .
  • Figure 3: Continual learning results on MATH-perturb-hard. Further fine-tuning STAT models based on their missing skills on unseen data yields a 3--4$\%$ gain (/ConSyn).
  • Figure 4: Trained model performances (Left) and performance gain over base model (Right) on Top 10 frequent missing skills, across training strategies on . Accuracies on the left plot are normalized per skill axis for better visualization. Our approaches and are most effective in enhancing model performance across nearly all the skills.
  • Figure 5: Comparison between synthesized questions from and .
  • ...and 2 more figures