Table of Contents
Fetching ...

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang

TL;DR

Step-KTO addresses the challenge of trustworthy mathematical reasoning in large language models by jointly supervising intermediate steps and final answers with binary signals. It introduces a Step-KTO objective that blends a Kahneman-Tversky–inspired value function with stepwise and outcome feedback, using a Process Reward Model for steps and an Outcome Reward Model for final solutions. Through iterative training with diverse candidate solutions and explicit stepwise evaluations, Step-KTO yields substantial gains on math benchmarks such as $\text{MATH-500}$, $\text{AMC23}$, and $\text{AIME24}$, including improved Pass@1 scores and reduced stepwise errors (e.g., from $27.3\%$ to under $20\%$). The results demonstrate that supervising the entire reasoning trajectory enhances both interpretability and reliability of solutions, positioning Step-KTO as a practical path toward more dependable mathematical reasoning in LLMs.

Abstract

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

TL;DR

Step-KTO addresses the challenge of trustworthy mathematical reasoning in large language models by jointly supervising intermediate steps and final answers with binary signals. It introduces a Step-KTO objective that blends a Kahneman-Tversky–inspired value function with stepwise and outcome feedback, using a Process Reward Model for steps and an Outcome Reward Model for final solutions. Through iterative training with diverse candidate solutions and explicit stepwise evaluations, Step-KTO yields substantial gains on math benchmarks such as , , and , including improved Pass@1 scores and reduced stepwise errors (e.g., from to under ). The results demonstrate that supervising the entire reasoning trajectory enhances both interpretability and reliability of solutions, positioning Step-KTO as a practical path toward more dependable mathematical reasoning in LLMs.

Abstract

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.
Paper Structure (21 sections, 8 equations, 1 figure, 3 tables)

This paper contains 21 sections, 8 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Step-KTO Training Process. Given a dataset of math problems (left), a language model (LLM) produces both reasoning steps and a final answer. Each intermediate reasoning step is evaluated by a process reward model (Process RM), and the final answer is assessed by an outcome reward model (Outcome RM). The binary feedback signals from both levels (outcome-level correctness $c^o$ and stepwise correctness $c^s_h$) are recorded together with the input $(x)$ and the model's response $(y)$ §\ref{['subsec:problem_setup']}. These signals are then used to compute the Step-KTO loss, guiding the LLM to not only produce correct final answers but also maintain coherent and correct reasoning steps §\ref{['subsec:incorporating_stepwise_feedback']}. Through multiple iterations of this training process §\ref{['subsec:training_process']}, the model progressively improves both its stepwise reasoning and final answer accuracy.