Table of Contents
Fetching ...

Variational Reasoning for Language Models

Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang

TL;DR

The paper addresses the instability and inefficiency of existing reasoning training approaches for large language models by proposing variational reasoning, which treats thinking traces as latent variables and optimizes a tractable objective via variational inference. It introduces an ELBO-based framework, extends it with IWAE-style multi-trace bounds, and stabilizes posterior learning with forward KL divergence, linking these ideas to rejection sampling finetuning and binary-reward RL. The method yields robust, significant improvements across math, coding, and general reasoning benchmarks on Qwen model families, and demonstrates robustness to prompt templates as well as some generalization to out-of-distribution tasks. The work offers a principled probabilistic perspective that unifies VI with RL-style methods and provides stable, scalable objectives for enhancing the reasoning capabilities of language models, with code available for replication.

Abstract

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

Variational Reasoning for Language Models

TL;DR

The paper addresses the instability and inefficiency of existing reasoning training approaches for large language models by proposing variational reasoning, which treats thinking traces as latent variables and optimizes a tractable objective via variational inference. It introduces an ELBO-based framework, extends it with IWAE-style multi-trace bounds, and stabilizes posterior learning with forward KL divergence, linking these ideas to rejection sampling finetuning and binary-reward RL. The method yields robust, significant improvements across math, coding, and general reasoning benchmarks on Qwen model families, and demonstrates robustness to prompt templates as well as some generalization to out-of-distribution tasks. The work offers a principled probabilistic perspective that unifies VI with RL-style methods and provides stable, scalable objectives for enhancing the reasoning capabilities of language models, with code available for replication.

Abstract

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

Paper Structure

This paper contains 36 sections, 1 theorem, 36 equations, 4 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

(Proof in Appendix proof:theorem1) For $|\mathcal{Y}_{\bm{x}}|>1$, the worst-case variances of the likelihood-based estimator and the accuracy-based estimator over all possible $\pi_{\theta}$ (under fixed $\pi_\theta(\mathcal{Y}_{\bm{x}} | {\bm{x}}, {\bm{z}})$) are Therefore, the accuracy-based estimator has lower worst-case variance, i.e., $\max_{\pi_{\theta}}\textrm{Var}_{\textrm{acc}}\leq \max

Figures (4)

  • Figure 1: Training loss and gradient norm of different methods during Qwen3-Base model training.
  • Figure 2: Pass@K comparison of baselines versus our method based on Qwen3-4B/8B-Base.
  • Figure 3: Effects of scaling up the number of thinking traces ($K$ in Algorithm \ref{['alg:variational']}) sampled from variational posterior $q_\phi$ on the performance of the final reasoning model $\pi_\theta$.
  • Figure 4: Density maps of the thinking token length versus the log-likelihood ratio $\log \frac{\pi_{\theta}({\bm{z}}_{k} \mid {\bm{x}})}{q_{\phi}({\bm{z}}_{k} \mid {\bm{x}},{\bm{y}}')}$ (left), and the answer token length versus the log-likelihood of the answer $\log \pi_\theta(\mathcal{Y}_{\bm{x}} \mid {\bm{x}}, {\bm{z}}_{k})$ (right).

Theorems & Definitions (1)

  • Theorem 1