Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Yen-Ting Lin; Di Jin; Tengyu Xu; Tianhao Wu; Sainbayar Sukhbaatar; Chen Zhu; Yun He; Yun-Nung Chen; Jason Weston; Yuandong Tian; Arash Rahnama; Sinong Wang; Hao Ma; Han Fang

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang

TL;DR

Step-KTO addresses the challenge of trustworthy mathematical reasoning in large language models by jointly supervising intermediate steps and final answers with binary signals. It introduces a Step-KTO objective that blends a Kahneman-Tversky–inspired value function with stepwise and outcome feedback, using a Process Reward Model for steps and an Outcome Reward Model for final solutions. Through iterative training with diverse candidate solutions and explicit stepwise evaluations, Step-KTO yields substantial gains on math benchmarks such as $\text{MATH-500}$, $\text{AMC23}$, and $\text{AIME24}$, including improved Pass@1 scores and reduced stepwise errors (e.g., from $27.3\%$ to under $20\%$). The results demonstrate that supervising the entire reasoning trajectory enhances both interpretability and reliability of solutions, positioning Step-KTO as a practical path toward more dependable mathematical reasoning in LLMs.

Abstract

Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

TL;DR

, and

, including improved Pass@1 scores and reduced stepwise errors (e.g., from

to under

). The results demonstrate that supervising the entire reasoning trajectory enhances both interpretability and reliability of solutions, positioning Step-KTO as a practical path toward more dependable mathematical reasoning in LLMs.

Abstract

Paper Structure (21 sections, 8 equations, 1 figure, 3 tables)

This paper contains 21 sections, 8 equations, 1 figure, 3 tables.

Introduction
Methodology
Problem Setup and Notation
KTO Background
Step-KTO
Iterative Training
Experiments
Task and Datasets
Baseline Methods
Implementation Details
Main Results
Iterative Training
Comparison with Step-DPO
Preference Optimization Variants
Evaluating Reasoning Quality
...and 6 more sections

Figures (1)

Figure 1: Step-KTO Training Process. Given a dataset of math problems (left), a language model (LLM) produces both reasoning steps and a final answer. Each intermediate reasoning step is evaluated by a process reward model (Process RM), and the final answer is assessed by an outcome reward model (Outcome RM). The binary feedback signals from both levels (outcome-level correctness $c^o$ and stepwise correctness $c^s_h$) are recorded together with the input $(x)$ and the model's response $(y)$ §\ref{['subsec:problem_setup']}. These signals are then used to compute the Step-KTO loss, guiding the LLM to not only produce correct final answers but also maintain coherent and correct reasoning steps §\ref{['subsec:incorporating_stepwise_feedback']}. Through multiple iterations of this training process §\ref{['subsec:training_process']}, the model progressively improves both its stepwise reasoning and final answer accuracy.

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

TL;DR

Abstract

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (1)