Table of Contents
Fetching ...

Reasoning to Learn from Latent Thoughts

Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto

TL;DR

The paper tackles data efficiency in large-language-model pretraining by introducing a reasoning-to-learn paradigm that augments text with latent, human-like thoughts encoded in natural language. It formalizes Latent Thought Models (LTM) as a latent-variable framework with Z representing latent thoughts behind observed text X, and trains LTMs via joint p(Z,X) and approximate posterior q(Z|X), using an EM-based Bootstrapping Latent Thoughts (BoLT) algorithm that iteratively improves latent quality through Monte Carlo sampling. Empirical results on a reasoning-heavy corpus demonstrate that synthetic latent thoughts significantly improve data efficiency and downstream math performance, with BoLT enabling monotonic self-improvement across iterations and continued bootstrapping in continual learning setups. The study discusses broader implications, limitations, and promising directions, including applying latent thoughts to general-domain data and hierarchical latent structures, highlighting a scalable path toward more data-efficient pretraining at scale.

Abstract

Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the \emph{latent thoughts} that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency over training on the same amount of raw data. Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM \emph{bootstraps its own performance} by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

Reasoning to Learn from Latent Thoughts

TL;DR

The paper tackles data efficiency in large-language-model pretraining by introducing a reasoning-to-learn paradigm that augments text with latent, human-like thoughts encoded in natural language. It formalizes Latent Thought Models (LTM) as a latent-variable framework with Z representing latent thoughts behind observed text X, and trains LTMs via joint p(Z,X) and approximate posterior q(Z|X), using an EM-based Bootstrapping Latent Thoughts (BoLT) algorithm that iteratively improves latent quality through Monte Carlo sampling. Empirical results on a reasoning-heavy corpus demonstrate that synthetic latent thoughts significantly improve data efficiency and downstream math performance, with BoLT enabling monotonic self-improvement across iterations and continued bootstrapping in continual learning setups. The study discusses broader implications, limitations, and promising directions, including applying latent thoughts to general-domain data and hierarchical latent structures, highlighting a scalable path toward more data-efficient pretraining at scale.

Abstract

Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the \emph{latent thoughts} that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency over training on the same amount of raw data. Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM \emph{bootstraps its own performance} by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

Paper Structure

This paper contains 87 sections, 8 equations, 22 figures, 1 table, 1 algorithm.

Figures (22)

  • Figure 1: Reasoning to learn. (Left) Motivated by how humans apply deliberate thinking to learn from limited data, we train an LM to infer (or "decompress") latent thoughts underlying the highly compressed observed data. These synthesized latent thoughts augment the raw observed data during pretraining, improving the LM's data efficiency. This procedure can be iteratively applied through an EM algorithm (\ref{['fig:bootstrap_em']}) and form a model self-improvement loop where increasingly capable LMs synthesize more effective latent thoughts, which in turn train more capable models. (Right) Our results demonstrate consistent improvement in model performance across bootstrap iterations.
  • Figure 2: Reasoning to learn with latent thought models (a) The latent thought model is trained to "decompress" plausible human thoughts underlying the observed data (i.e., $q(Z \, \vert \, X)$) and to utilize the latent thoughts in learning more efficiently from the data (i.e., $p(Z, X)$), resembling a deliberate human thought process. (b) The latent thought is modeled for each chunk of text in an autoregressive manner and in the same discrete text space. Given paired data $\left\{(Z_n, X_n)\right\}_{n=1}^N$, we use standard next-token prediction to train a single LM as both the $p(Z,X)$ and $q(Z|X)$, by randomly placing $Z_n$ before or after $X_n$ in the sequence. This strategy allows for minimal modifications to the standard LM pretraining pipeline.
  • Figure 3: We can use GPT-4o-mini to synthesize latent thoughts to train the initial latent thought model. The synthetic latent thoughts as shown in (b) typically contain the background knowledge and reasoning not explicitly stated in the raw data, presented in a consistent and clean form. The prompt and example are simplified for clarity; see \ref{['prompt:gpt4o_warmstart_full']} and \ref{['sec:gpt4o_synthetic_examples']} for the full prompt and additional examples.
  • Figure 4: Downstream transfer of latent thoughts. Since models have been trained on data augmented with latent thoughts at scale, they can be few-shot prompted to perform CoT reasoning in the latent space$Z$ on downstream tasks (e.g., MATH). We provide the prompt in \ref{['sec:prompts:few_show_cot_eval']} and additional examples in \ref{['sec:downstream_transfer_examples']}.
  • Figure 5: Bootstrapping latent thoughts (BoLT) in an iterative Expectation-Maximization algorithm. In the E-step, we use Monte Carlo sampling as a "policy improvement operator" to obtain higher-quality latent thoughts. This boosts learning efficiency in the M-step, enabling the training of more capable LMs that synthesize better latent thoughts.
  • ...and 17 more figures