Latent Reasoning with Supervised Thinking States

Ido Amos; Avi Caciularu; Mor Geva; Amir Globerson; Jonathan Herzig; Lior Shani; Idan Szpektor

Latent Reasoning with Supervised Thinking States

Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig, Lior Shani, Idan Szpektor

TL;DR

This work introduces Thinking States, a recurrent reasoning framework that generates natural-language thoughts while processing input chunks, compresses them into fixed-size states, and injects these states into subsequent token representations without expanding the context window. The Thinking Block T and Compression Block C enable a chunk-recurrent architecture with teacher-forced supervision, allowing fully parallel training and avoiding backpropagation through time. Empirically, Thinking States improves latent reasoning baselines, matches or nears CoT on multi-hop QA, and delivers significant speedups on state-tracking and GSM8K-style tasks, while maintaining strong length generalization. The approach also provides interpretability through the recovered thinking traces and identifies failure modes such as state ambiguity, suggesting prompts or decoding refinements as possible remedies and future directions including RL-based fine-tuning.

Abstract

Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs due to the generation of long rationales. We propose Thinking States, a method that performs reasoning {\em while} the input is processing. Specifically, Thinking States generates sequences of thinking tokens every few input tokens, transforms the thoughts back into embedding space, and adds them to the following input tokens. This has two key advantages. First, it captures the recurrent nature of CoT, but where the thought tokens are generated as input is processing. Second, since the thoughts are represented as tokens, they can be learned from natural language supervision, and using teacher-forcing, which is parallelizable. Empirically, Thinking States outperforms other latent reasoning methods on multiple reasoning tasks, narrowing the gap to CoT on math problems, and matching its performance on 2-Hop QA with improved latency. On state-tracking tasks, we show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.

Latent Reasoning with Supervised Thinking States

TL;DR

Abstract

Paper Structure (31 sections, 6 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 6 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Latent Reasoning.
Emergent Latent Reasoning in Standard LLMs.
Alternative Approaches to Latent Computation.
Thinking State Architecture and Supervision
Architecture
Training with Teacher-Forced Reasoning
Fast Prefill with Speculative Thinking
Constructing Chunk-Level Supervision
Step-to-Token Alignment.
Token-to-Chunk Alignment.
Experiments
Experimental Setup
Metrics.
...and 16 more sections

Figures (7)

Figure 1: Reasoning with a Thinking State compared to Chain-of-Thought: An LLM is trained to generate task-relevant thinking sequences while processing tokens. Each generated thought sequence is transformed into a fixed-size state, denoted by $S$, added to the following token representations.
Figure 2: Thinking States model at inference and training time. At inference time (a), Thinking States reasons by iteratively processing token chunks. At each iteration $i$, reasoning information encoded up to layer $L^{out}$ is decoded by a thinking block $T$ and compressed by a compression block $C$ to a fixed-size thinking state$\mathbf{S_{i+1}}$. The state is injected to the next chunk at a shallow layer to effectively influence future computations. Chunk representations can access the history via attention layers and the KV-cache. At training time (b), predictions by $T$ are supervised by explicit annotations paired with each chunk. The same annotations are used to condition the base LLM's reasoning, allowing parallel training.
Figure 3: Ablation studies over deep to shallow recurrence (a) and effects of the chunk size (b), with measured speedup over CoT (top) at each point. In (a), performance increases with the number of layers used to process and generate states, pointing to the importance of the deep to shallow design. In (b), increasing the chunk size leads to improved latency with peak performance for medium chunks, highlighting the tradeoff between per-state capacity (large chunks) and reasoning frequency (small chunks).
Figure 4: Examples where Thinking States succeeds but CoT fails. Thinking States reasoning is shown in green. (1) CoT hallucinates an extra step. (2) CoT attempts multiple operations in one step and errs.
Figure 5: Illustration of state ambiguity. In (a), Thinking States generates valid thoughts that are not aligned with the final clause in the query, leading to an error. By adding the information to the begining of the query, as in (b), the same model produces a correct answer, without training on the new format.
...and 2 more figures

Latent Reasoning with Supervised Thinking States

TL;DR

Abstract

Latent Reasoning with Supervised Thinking States

Authors

TL;DR

Abstract

Table of Contents

Figures (7)