Correct Reasoning Paths Visit Shared Decision Pivots
Dongkyu Cho, Amy B. Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Rui Song, Hengrui Cai
TL;DR
This work tackles the challenge of verifying and training chain-of-thought explanations by introducing decision pivots—minimal, verifiable checkpoints that correct reasoning paths must visit. It proposes ROMA, a three-stage self-training framework that (i) bootstraps diverse reasoning paths to mine shared pivots, (ii) compresses them into pivot-focused short-path reasoning using a fine-tuned verifier, and (iii) post-trains the model with pairwise preference optimization (DPO) to emphasize concise, pivot-rich explanations. Across LogiQA, MedQA, and MATH500, pivot-centric SPR with a domain-tuned verifier yields consistent improvements in both task accuracy and reasoning quality, outperforming metric-based filtering and naïve self-training. The method reduces reasoning length and provides a scalable, domain-adaptable approach to aligning reasoning without requiring ground-truth rationales. Overall, the pivot-based framework offers a practical path to more faithful, verifiable reasoning in LLMs with broad applicability across domains.
Abstract
Chain-of-thought (CoT) reasoning exposes the intermediate thinking process of large language models (LLMs), yet verifying those traces at scale remains unsolved. In response, we introduce the idea of decision pivots-minimal, verifiable checkpoints that any correct reasoning path must visit. We hypothesize that correct reasoning, though stylistically diverse, converge on the same pivot set, while incorrect ones violate at least one pivot. Leveraging this property, we propose a self-training pipeline that (i) samples diverse reasoning paths and mines shared decision pivots, (ii) compresses each trace into pivot-focused short-path reasoning using an auxiliary verifier, and (iii) post-trains the model using its self-generated outputs. The proposed method aligns reasoning without ground truth reasoning data or external metrics. Experiments on standard benchmarks such as LogiQA, MedQA, and MATH500 show the effectiveness of our method.
