Table of Contents
Fetching ...

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

Tianqianjin Lin, Xi Zhao, Xingyao Zhang, Rujiao Long, Yi Xu, Zhuoren Jiang, Wenbo Su, Bo Zheng

TL;DR

The paper tackles the difficulty of sampling high-utility reasoning paths in reinforcement-learning–driven fine-tuning of large language models. It introduces RAVR, an end-to-end framework that conditions reasoning on reference answers, forms an amortized posterior over reasoning paths, and optimizes via a variational objective that couples posterior and prior with a KL penalty and a utility-based reward. The authors prove that answer conditioning amplifies the likelihood of high-utility reasoning paths and demonstrate substantial gains on both general and math reasoning benchmarks, along with analyses of reasoning behavior and learning dynamics. This approach reduces exploration risk and improves robustness, with practical impact for deploying more capable, reasoning-rich LLMs; code and data are made available to support reproducibility.

Abstract

Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

TL;DR

The paper tackles the difficulty of sampling high-utility reasoning paths in reinforcement-learning–driven fine-tuning of large language models. It introduces RAVR, an end-to-end framework that conditions reasoning on reference answers, forms an amortized posterior over reasoning paths, and optimizes via a variational objective that couples posterior and prior with a KL penalty and a utility-based reward. The authors prove that answer conditioning amplifies the likelihood of high-utility reasoning paths and demonstrate substantial gains on both general and math reasoning benchmarks, along with analyses of reasoning behavior and learning dynamics. This approach reduces exploration risk and improves robustness, with practical impact for deploying more capable, reasoning-rich LLMs; code and data are made available to support reproducibility.

Abstract

Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.

Paper Structure

This paper contains 20 sections, 27 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: We sampled 50 hard questions from CrossThink-QA akter2025nemotron on which Qwen3-1.7B, with thinking mode enabled, failed to obtain the correct answer across 8 attempts. For each question, we provided the answer in the prompt and asked the LLM to derive the reasoning. A GPT-5 judge, using strict criteria, evaluated whether the derived reasoning was logically coherent without indicating access to answer. In over 50% of the unlearnable questions, the LLM was able to produce correct reasoning. See Appendix \ref{['appx:motivation_case']} for detailed prompts and cases.
  • Figure 2: The framework of RAVR. According to Section \ref{['sec:motivation']}, seeing the reference answer can amplify the sampling probability of good reasoning paths. Hence, we use this answer-conditioned posterior to help the learning of the question-only prior. The LLM is instructed to derive reasoning path from the question to the answer with thinking mode enabled. RAVR regard the reference-answer probability as the reward for the generated reasoning and maximize it to enhance the ability of the LLM to think why is this the answer. Meanwhile, RAVR minimize the KL divergence between the posterior and the question-only prior to help the model better think what is the answer and in turn, the prior also regularizes the behavior of the posterior. See Section \ref{['sec:vi-objective-discrete']} for details.
  • Figure 3: Comparison of reasoning behaviors wihin thinking tags. Words on the x-axis are those frequently used during thinking, and the y-axis represents their average frequency per response. See Appendix \ref{['appx:reasoning_behavior']} for more results.
  • Figure 4: Comparison with GRPO across different rollout group sizes. When using a rollout group size of 8, RAVR attains or exceeds the performance of GRPO with a rollout group size of 24. This observation suggests that our approach markedly enhances the sampling efficiency of high-quality reasoning paths, thereby improves learning stability and efficiency.
  • Figure 5: Learning Dynamics. Training on CrossThink-QA, testing on GPQA and MMLU-pro.
  • ...and 6 more figures