Table of Contents
Fetching ...

Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

Raphael Schumann, Stefan Riezler

TL;DR

The paper studies how the solvability of MCQA items affects chain-of-thought reasoning in large language models, showing unsolvable questions drive false-positive CoTs and hallucinations. It introduces a solvability measure $p^{\theta}_{\text{solvable}}(q_i)$ derived from a Beta posterior over sampled outputs and demonstrates a sweet-spot regime where learning is most effective. The authors integrate solvability into two learning paradigms: MCQ-ORM (an outcome-based reward model) and MCQ-DrGRPO (solvability-weighted group-relative advantage in RL), achieving higher process-correct CoTs and improved answer accuracy in RL across math and multimodal tasks. Across datasets, solvability-informed training reduces hallucinations and enhances reliability of CoT reasoning, with practical benefits for test-time CoT selection and policy optimization.

Abstract

Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.

Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

TL;DR

The paper studies how the solvability of MCQA items affects chain-of-thought reasoning in large language models, showing unsolvable questions drive false-positive CoTs and hallucinations. It introduces a solvability measure derived from a Beta posterior over sampled outputs and demonstrates a sweet-spot regime where learning is most effective. The authors integrate solvability into two learning paradigms: MCQ-ORM (an outcome-based reward model) and MCQ-DrGRPO (solvability-weighted group-relative advantage in RL), achieving higher process-correct CoTs and improved answer accuracy in RL across math and multimodal tasks. Across datasets, solvability-informed training reduces hallucinations and enhances reliability of CoT reasoning, with practical benefits for test-time CoT selection and policy optimization.

Abstract

Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.

Paper Structure

This paper contains 30 sections, 16 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Modeling Solvability: The probability that a question is solvable by a given model, as defined by Equation \ref{['equ:solvable']}. (Left) Varying number of answer options for the multiple-choice question. (Right) Varying number of sampled CoTs per question. At Least One Process-Correct CoT: Fraction of questions where at least one of the 32 generated CoTs is process-correct. Questions are from the AQuA dataset (five choices) and CoTs are sampled with Llama3 1B (left) and Llama3 8B (right).
  • Figure 2: Advantage values of a single CoT with positive reward. 32 CoTs are sampled for each question and the x-axis denotes the number of answer-correct (positive reward) CoTs in a group. MCQ-DrGRPO down-weights CoTs that are generated for unsolvable questions. The probability that a multiple-choice question is unsolvable for the model depends on the number of choices $|{\bm{c}}_i|$. The values on the y-axis are omitted to allow visual comparison across methods. During training the relative differences between groups are important.
  • Figure 3: We sample 32 CoTs for each question in the respective training set. Questions are then categorized into buckets based on the number of answer-correct CoTs. We randomly sample questions from each bucket and pair them with exactly one of their answer-correct CoTs. We finetune the base model on these 2k instances and report the increase in answer accuracy over the base model on a held out development set. Experiments are repeated five times with different random seeds. Learning potential (LP) predicts relative increase in answer accuracy based on bucket membership.
  • Figure 4: The average reward during RL training with DrGRPO and MCQ-DrGRPO. The math reasoning dataset AQuA is used to train Llama3 1B and geo/year-guessing datasets are used to train multimodal Aya 8B. First graph also shows ablations for different numbers of sampled CoTs per question. Each model is trained with three different random seeds.
  • Figure 5: Additional metrics recorded during the reinforcement learning experiments. Length of Correct CoT: Average number of tokens in an answer-correct CoT. Sequence Entropy: Summed token entropy for a CoT sequence, normalized by length. Answer-Pass@4: Percentage of questions with at least one answer-correct CoT among four samples.
  • ...and 9 more figures