Table of Contents
Fetching ...

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao

TL;DR

This paper addresses the inefficiency and non-stationarity of traditional reasoning augmentation in LLMs by internalizing reasoning into latent trajectories. It introduces STIR, a three-stage framework that distills latent reasoning successes into a sparse library of steering primitives and employs a value-modulated, anchor-based gating controller to intervene dynamically during inference. The approach yields consistent improvements in reasoning accuracy with substantial reductions in token usage across six benchmarks and multiple models, establishing a new Pareto frontier for accuracy-efficiency. By demonstrating transferable latent tools and leveraging contrastive rollouts, STIR decouples reasoning depth from sequence length while preserving coherence. The work contributes a practical, scalable pathway to more reliable, efficient internal reasoning in LLMs with broad implications for deployable AI systems.

Abstract

The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM-Latent-Action.

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

TL;DR

This paper addresses the inefficiency and non-stationarity of traditional reasoning augmentation in LLMs by internalizing reasoning into latent trajectories. It introduces STIR, a three-stage framework that distills latent reasoning successes into a sparse library of steering primitives and employs a value-modulated, anchor-based gating controller to intervene dynamically during inference. The approach yields consistent improvements in reasoning accuracy with substantial reductions in token usage across six benchmarks and multiple models, establishing a new Pareto frontier for accuracy-efficiency. By demonstrating transferable latent tools and leveraging contrastive rollouts, STIR decouples reasoning depth from sequence length while preserving coherence. The work contributes a practical, scalable pathway to more reliable, efficient internal reasoning in LLMs with broad implications for deployable AI systems.

Abstract

The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM-Latent-Action.
Paper Structure (35 sections, 41 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 35 sections, 41 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of the STIR framework. The pipeline operates through three stages: (1) differential intrinsic action induction distills latent steering impulses by analyzing the contrastive residuals between high-reward and low-reward rollouts at critical decision points; (2) sparse control basis construction filters these raw candidates into a geometrically diverse tool library to maximize representational coverage; and (3) value-modulated trajectory intervention acts as a runtime controller that retrieves relevant steering impulses and validates them via lookahead probing before dynamically injecting them into the residual stream.
  • Figure 2: t-SNE visualization of latent state embeddings extracted from stochastic rollouts. $\blacktriangle$ denotes failure states ($\mu^-$), while $\bullet$ represents the corresponding rectified states ($\mu^+$). The directed edges illustrate the steering impulses, revealing that error states form geometrically coherent clusters that can be bridged to the high-reward manifold via specific translational vectors.
  • Figure 3: Sensitivity analysis of key control hyperparameters on AMC23 and MATH500. Subplots (a) and (b) demonstrate the inverted-U relationship between injection strength $k_{\text{scale}}$ and reasoning accuracy. Subplots (c) and (d) illustrate the impact of normalized layer depth and reveal that steering interventions are most effective within the intermediate transformer blocks.
  • Figure 4: Cross-task generalization analysis. The heatmaps depict the transfer performance when a tool library distilled from a source dataset (y-axis) is applied to a target dataset (x-axis). The left panel reports reasoning accuracy, while the right panel shows the average token count. The strong performance in off-diagonal entries validates that STIR captures transferable latent tools that generalize across distinct tasks and domains, demonstrating robustness against specific problem distributions.
  • Figure 5: Token usage distribution on AMC 23. Comparison between STIR (green) and Vanilla (red) across four models with moderate ($k_{scale}=0.75$, top) and high ($k_{scale}=1.0$, bottom) injection strengths. The uniform leftward shift demonstrates that STIR effectively streamlines reasoning by bypassing redundant steps while simultaneously enhancing solution accuracy.