Training Chain-of-Thought via Latent-Variable Inference

Du Phan; Matthew D. Hoffman; David Dohan; Sholto Douglas; Tuan Anh Le; Aaron Parisi; Pavel Sountsov; Charles Sutton; Sharad Vikram; Rif A. Saurous

Training Chain-of-Thought via Latent-Variable Inference

Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous

TL;DR

TRICE reframes chain-of-thought prompting as latent-variable inference and introduces a principled MCMC-EM fine-tuning approach that marginalizes over rationales. By maintaining a memory of candidate rationales and using an independence-sampler Metropolis-Hastings update, TRICE learns to maximize the marginal likelihood of answers with variance-reduced gradient estimates through a control variate and gradient-subsampling. Empirically, TRICE improves accuracy on GSM8K and BBH Hard relative to STaR, direct CoT tuning, and rejection sampling, while requiring fewer hand-annotated rationales. The approach broadens the toolkit for robust, interpretable reasoning in LLMs and could extend to nondeterministic outputs and tool-use scenarios.

Abstract

Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.

Training Chain-of-Thought via Latent-Variable Inference

TL;DR

Abstract

Paper Structure (30 sections, 23 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 30 sections, 23 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Method
Derivation
The true gradient.
Independence sampler for $p_\theta(z\mid x, y)$.
Basic gradient estimator.
Adding a control variate.
Estimating $\beta$.
Gradient subsampling.
Why not variational inference, reweighted wake-sleep, or rejection sampling?
Related Work
Self-Taught Reasoner
Experiments
Discussion
Limitations:
...and 15 more sections

Figures (7)

Figure 1: Example of rationale lengths shrinking during RWS training. Blue line shows the average number of tokens per rationale generated by the guide, orange line shows the average number of tokens per rationale weighted by the rationale's importance weight.
Figure 2: Time-varying estimates (with loess smoothers) of average training-set accuracy $p(y\mid x)$ and greedy-decoding validation-set accuracy for TRICE with and without subsampled control-variate gradient estimator ("TRICE CV" and "TRICE no CV" respectively) and four-particle rejection sampling ("RS") on GSM8K.
Figure 3: Examples of rationales where TRICE gets the answer right, right but for the wrong reasons, and wrong.
Figure 4: Examples of rationales where TRICE gets the answer right, right but for the wrong reasons, and wrong.
Figure 5: Example where the prompt-tuned RWS guide model pastes in the correct answer at the end, contradicting the rationale up to that point. The rationales generated by the guide and model are almost identical up to the final answer block.
...and 2 more figures

Training Chain-of-Thought via Latent-Variable Inference

TL;DR

Abstract

Training Chain-of-Thought via Latent-Variable Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (7)