Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang
TL;DR
Step-KTO addresses the challenge of trustworthy mathematical reasoning in large language models by jointly supervising intermediate steps and final answers with binary signals. It introduces a Step-KTO objective that blends a Kahneman-Tversky–inspired value function with stepwise and outcome feedback, using a Process Reward Model for steps and an Outcome Reward Model for final solutions. Through iterative training with diverse candidate solutions and explicit stepwise evaluations, Step-KTO yields substantial gains on math benchmarks such as $\text{MATH-500}$, $\text{AMC23}$, and $\text{AIME24}$, including improved Pass@1 scores and reduced stepwise errors (e.g., from $27.3\%$ to under $20\%$). The results demonstrate that supervising the entire reasoning trajectory enhances both interpretability and reliability of solutions, positioning Step-KTO as a practical path toward more dependable mathematical reasoning in LLMs.
Abstract
Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.
