Table of Contents
Fetching ...

Correct Reasoning Paths Visit Shared Decision Pivots

Dongkyu Cho, Amy B. Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Rui Song, Hengrui Cai

TL;DR

This work tackles the challenge of verifying and training chain-of-thought explanations by introducing decision pivots—minimal, verifiable checkpoints that correct reasoning paths must visit. It proposes ROMA, a three-stage self-training framework that (i) bootstraps diverse reasoning paths to mine shared pivots, (ii) compresses them into pivot-focused short-path reasoning using a fine-tuned verifier, and (iii) post-trains the model with pairwise preference optimization (DPO) to emphasize concise, pivot-rich explanations. Across LogiQA, MedQA, and MATH500, pivot-centric SPR with a domain-tuned verifier yields consistent improvements in both task accuracy and reasoning quality, outperforming metric-based filtering and naïve self-training. The method reduces reasoning length and provides a scalable, domain-adaptable approach to aligning reasoning without requiring ground-truth rationales. Overall, the pivot-based framework offers a practical path to more faithful, verifiable reasoning in LLMs with broad applicability across domains.

Abstract

Chain-of-thought (CoT) reasoning exposes the intermediate thinking process of large language models (LLMs), yet verifying those traces at scale remains unsolved. In response, we introduce the idea of decision pivots-minimal, verifiable checkpoints that any correct reasoning path must visit. We hypothesize that correct reasoning, though stylistically diverse, converge on the same pivot set, while incorrect ones violate at least one pivot. Leveraging this property, we propose a self-training pipeline that (i) samples diverse reasoning paths and mines shared decision pivots, (ii) compresses each trace into pivot-focused short-path reasoning using an auxiliary verifier, and (iii) post-trains the model using its self-generated outputs. The proposed method aligns reasoning without ground truth reasoning data or external metrics. Experiments on standard benchmarks such as LogiQA, MedQA, and MATH500 show the effectiveness of our method.

Correct Reasoning Paths Visit Shared Decision Pivots

TL;DR

This work tackles the challenge of verifying and training chain-of-thought explanations by introducing decision pivots—minimal, verifiable checkpoints that correct reasoning paths must visit. It proposes ROMA, a three-stage self-training framework that (i) bootstraps diverse reasoning paths to mine shared pivots, (ii) compresses them into pivot-focused short-path reasoning using a fine-tuned verifier, and (iii) post-trains the model with pairwise preference optimization (DPO) to emphasize concise, pivot-rich explanations. Across LogiQA, MedQA, and MATH500, pivot-centric SPR with a domain-tuned verifier yields consistent improvements in both task accuracy and reasoning quality, outperforming metric-based filtering and naïve self-training. The method reduces reasoning length and provides a scalable, domain-adaptable approach to aligning reasoning without requiring ground-truth rationales. Overall, the pivot-based framework offers a practical path to more faithful, verifiable reasoning in LLMs with broad applicability across domains.

Abstract

Chain-of-thought (CoT) reasoning exposes the intermediate thinking process of large language models (LLMs), yet verifying those traces at scale remains unsolved. In response, we introduce the idea of decision pivots-minimal, verifiable checkpoints that any correct reasoning path must visit. We hypothesize that correct reasoning, though stylistically diverse, converge on the same pivot set, while incorrect ones violate at least one pivot. Leveraging this property, we propose a self-training pipeline that (i) samples diverse reasoning paths and mines shared decision pivots, (ii) compresses each trace into pivot-focused short-path reasoning using an auxiliary verifier, and (iii) post-trains the model using its self-generated outputs. The proposed method aligns reasoning without ground truth reasoning data or external metrics. Experiments on standard benchmarks such as LogiQA, MedQA, and MATH500 show the effectiveness of our method.

Paper Structure

This paper contains 40 sections, 5 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: A model's chain-of-thought reasoning may take various paths to reach a final decision, naturally visiting redundant or incorrect thought steps. However, such errors are not easily captured by existing methods. Instead, we introduce the concept of decision pivots, which are a set of key information that a model's reasoning path must visit to reach a certain decision. In this sense, we aim to generate a concise, factual reasoning path that focuses on the decision-pivots, which we denote as Short-Path Reasoning (SPR).
  • Figure 2: We present a novel self-training framework ROMA that leverages the concept of decision-pivots. Our self-training framework works in 3 stages. Given a question: (A) model produces multiple $\textsc{prediction} + \textsc{reasoning}$ pairs, ensuring we collect $K$ reasoning paths using re-sampling. (B) Then, a fine-tuned verifier synthesizes a short-path reasoning that focuses on shared decision-pivots, generating a preference data pair. (C) Use the generated data for Reinforcement Learning (i.e., preference learning with DPO). This process can be repeated as a self-training loop that improves the model's reasoning capabilities.
  • Figure 2: ROSCOE scores and the downstream LogiQA accuracy across model sizes.
  • Figure 3: Comparison of self-training results on LogiQA and MedQA. Our proposed method provides large self-improvement gains in domains (e.g., MedQA -- healthcare, LogiQA -- general language) that are generally difficult to verify the generated reasoning, compared to math/coding domains (e.g., MATH500 -- math). S-T (ROSCOE) refers to self-training with ROSCOE-filtered reasoning. The error bars indicate the standard error over 5 runs.
  • Figure 4: Effect of the fine-tuned verifier on downstream accuracy across post-training regimes. We compare our method with vs without a verifier on LogiQA, MedQA, and MATH500. The verifier yields consistent gains and is most impactful in the expert MedQA domain, supporting our claim that a domain-tuned verifier synthesizes higher-quality, pivot-focused short-path reasoning. Improvements are present but smaller on MATH500 (where an external verification signal already exists) and modest on LogiQA, aligning with our hypothesis and analysis.
  • ...and 6 more figures