Table of Contents
Fetching ...

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin, Jialu Wang, Ruijie Wang

TL;DR

SPOT is a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template and introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts.

Abstract

Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

TL;DR

SPOT is a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template and introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts.

Abstract

Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.
Paper Structure (57 sections, 18 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 57 sections, 18 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Representative coupling paradigms in hybrid/latent reasoning. Each gray block denotes a contiguous explicit reasoning span (a paragraph-level segment delimited by blank lines), and each orange block denotes a latent token.
  • Figure 2: Overview of the SPOT framework. Stage I trains the student on SpanDrop sequences by anchoring <pause> hidden states to the corresponding teacher span under the frozen LM head, using a Sinkhorn-based alignment loss. Stage II applies RFT by selecting correct completions and preferring shorter ones, improving stability under external <pause> insertion.
  • Figure 3: Controllability under external <pause> insertion. Bars report Pass@1 accuracy (Acc, left axis) and lines report output length (#L, right axis) when inserting one <pause> after every $N$ spans within the <think> segment.
  • Figure 4: Training-time alignment diagnostics on SpanDrop examples. Left axis: span-level Sinkhorn OT alignment loss (lower is better). Right axis: frozen-head top-$K$ coverage between tokens decoded from the <pause> hidden state and the token set of the paired teacher span (Eq. \ref{['eq:topk_overlap']}, $K{=}20$; higher is better). The dashed line denotes the vanilla backbone with externally inserted <pause> tokens evaluated under the same computation.
  • Figure 5: A successful <pause> compression example.