Table of Contents
Fetching ...

Guiding Through Complexity: What Makes Good Supervision for Hard Math Reasoning Tasks?

Xuan He, Da Yin, Nanyun Peng

TL;DR

The paper tackles how to supervise LLMs on hard math reasoning tasks when teachers are weak, formalizing a comparison between hard full-task supervision ($\mathcal{D}_{\mathrm{Hard}}$) and easy-subtask supervision ($\mathcal{D}_{\mathrm{Subtask}}$) under controlled outcome error rates $\epsilon_{\mathrm{Hard}}$ and $\epsilon_{\mathrm{subtask}}$. Through a two-stage data-synthesis pipeline and supervised fine-tuning across five hard benchmarks (including $\text{MATH}$, Olympic-Arena math, $\text{JEE-Bench}$, $\text{Gaokao-Mathcloze}$, and $\text{SAT-Math}$), the study finds that hard full-task supervision consistently outperforms subtask supervision even at high $\epsilon_{\mathrm{Hard}}$, and that the step-wise error rate (SWER) of the supervision strongly correlates with downstream performance. The authors further show that supplementing hard-task supervision with corresponding subtasks yields notable gains, with the best results arising from carefully chosen combinations of hard and subtask ERs (e.g., $\epsilon_{\mathrm{Hard}}=50\%$, $\epsilon_{\,\mathrm{Subtask}}=10\%$ on several benchmarks). These insights offer practical guidance for data collection and augmentation in LLM alignment, emphasizing that step-level solution quality matters more than final-output accuracy and that combining hard and easy supervision can push reasoning performance beyond single-source approaches. Data and code release supports reproducibility and future work explores broader domains and additional alignment methods.

Abstract

How can "weak teacher models" such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and requires expertise or daily practice from the teacher models? In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging. Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90\%), training on such data can outperform perfectly correct supervision of easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions. Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30\% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation. Data and code are released at https://github.com/hexuan21/Weak-to-Strong.

Guiding Through Complexity: What Makes Good Supervision for Hard Math Reasoning Tasks?

TL;DR

The paper tackles how to supervise LLMs on hard math reasoning tasks when teachers are weak, formalizing a comparison between hard full-task supervision () and easy-subtask supervision () under controlled outcome error rates and . Through a two-stage data-synthesis pipeline and supervised fine-tuning across five hard benchmarks (including , Olympic-Arena math, , , and ), the study finds that hard full-task supervision consistently outperforms subtask supervision even at high , and that the step-wise error rate (SWER) of the supervision strongly correlates with downstream performance. The authors further show that supplementing hard-task supervision with corresponding subtasks yields notable gains, with the best results arising from carefully chosen combinations of hard and subtask ERs (e.g., , on several benchmarks). These insights offer practical guidance for data collection and augmentation in LLM alignment, emphasizing that step-level solution quality matters more than final-output accuracy and that combining hard and easy supervision can push reasoning performance beyond single-source approaches. Data and code release supports reproducibility and future work explores broader domains and additional alignment methods.

Abstract

How can "weak teacher models" such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and requires expertise or daily practice from the teacher models? In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging. Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90\%), training on such data can outperform perfectly correct supervision of easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions. Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30\% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation. Data and code are released at https://github.com/hexuan21/Weak-to-Strong.

Paper Structure

This paper contains 37 sections, 3 figures, 43 tables.

Figures (3)

  • Figure 1: Overview of our empirical study on two contrasting supervision strategies and further analysis.
  • Figure 2: Comparison of easy and hard task supervision with varying outcome ER on 5 hard reasoning benchmarks.
  • Figure 3: Performance of the supervision synthesized by different teacher models under similar outcome error rates (a.k.a., ER).