Table of Contents
Fetching ...

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, Huan Wang

TL;DR

LaTRO reframes reasoning in large language models as sampling from a latent distribution and trains the model with a variational objective that uses its own reasoning as a self-generated reward. By adopting a KL-regularized latent reasoner and a REINFORCE Leave-One-Out gradient estimator, LaTRO simultaneously improves reasoning generation and the evaluation of reasoning quality without external feedback. Empirically, LaTRO yields substantial gains on GSM8K across multiple architectures and competitive gains on ARC-Challenge, while enabling a shift of some computation from inference to training. The work suggests pretrained LLMs harbor latent, activatable reasoning capabilities that can be unlocked through self-improvement dynamics during training, with implications for scalable reasoning in AI systems.

Abstract

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{https://github.com/SalesforceAIResearch/LaTRO}.

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

TL;DR

LaTRO reframes reasoning in large language models as sampling from a latent distribution and trains the model with a variational objective that uses its own reasoning as a self-generated reward. By adopting a KL-regularized latent reasoner and a REINFORCE Leave-One-Out gradient estimator, LaTRO simultaneously improves reasoning generation and the evaluation of reasoning quality without external feedback. Empirically, LaTRO yields substantial gains on GSM8K across multiple architectures and competitive gains on ARC-Challenge, while enabling a shift of some computation from inference to training. The work suggests pretrained LLMs harbor latent, activatable reasoning capabilities that can be unlocked through self-improvement dynamics during training, with implications for scalable reasoning in AI systems.

Abstract

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{https://github.com/SalesforceAIResearch/LaTRO}.

Paper Structure

This paper contains 25 sections, 2 theorems, 10 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Denote the user query, model response, and reasoning rationale by $\pmb{x}, \pmb{y},\pmb{z}$, respectively. The distribution of the majority vote answer of the $K$ reasoning rationales obtained by CoT-SC approximates $p_{M}(\pmb{y} | \pmb{x}):= \mathbb{E}_{\pmb{z}\sim\mathop{\mathrm{\pi_\theta}}\nol

Figures (9)

  • Figure 1: Overview of LaTRO with an example question from GSM8K cobbe2021training. LaTRO treats reasoning trajectories as latent variables and optimizes the underlying distribution through self-rewarding. Given a question, the language model generates multiple reasoning rationales, evaluates their likelihood of producing the correct answer, and updates its parameters to favor high-quality rationales. This iterative process allows the model to improve both its ability to generate good reasoning paths and to evaluate the quality of those paths.
  • Figure 2: Average negative log probabilities of LLMs to generate correct responses.
  • Figure 3: Ablation study results on GSM8K with base model Phi-3.5. In (a), the $x$-axis represents various maximum token length $L$ of reasoning rationales, $y$-axis is the accuracy, and the plot shows the zero-shot performance v.s. various maximum token lengths for different methods. In (b), the $x$-axis represents the # of sampled reasoning rationales, the $y$-axis is the accuracy, and the plot shows the zero-shot performance v.s. the # of reasoning rationales used in the majority vote.
  • Figure 4: Sample responses of a GSM8K question from Mistral-7B models, the errors are highlighted in red.
  • Figure 5: CoT template for GSM8K
  • ...and 4 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof