Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
Raphael Schumann, Stefan Riezler
TL;DR
The paper studies how the solvability of MCQA items affects chain-of-thought reasoning in large language models, showing unsolvable questions drive false-positive CoTs and hallucinations. It introduces a solvability measure $p^{\theta}_{\text{solvable}}(q_i)$ derived from a Beta posterior over sampled outputs and demonstrates a sweet-spot regime where learning is most effective. The authors integrate solvability into two learning paradigms: MCQ-ORM (an outcome-based reward model) and MCQ-DrGRPO (solvability-weighted group-relative advantage in RL), achieving higher process-correct CoTs and improved answer accuracy in RL across math and multimodal tasks. Across datasets, solvability-informed training reduces hallucinations and enhances reliability of CoT reasoning, with practical benefits for test-time CoT selection and policy optimization.
Abstract
Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.
