Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

Zhenning Shi; Yijia Zhu; Junhan Shi; Xun Zhang; Lei Wang; Congcong Miao

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao

TL;DR

This paper addresses the inefficiency and non-stationarity of traditional reasoning augmentation in LLMs by internalizing reasoning into latent trajectories. It introduces STIR, a three-stage framework that distills latent reasoning successes into a sparse library of steering primitives and employs a value-modulated, anchor-based gating controller to intervene dynamically during inference. The approach yields consistent improvements in reasoning accuracy with substantial reductions in token usage across six benchmarks and multiple models, establishing a new Pareto frontier for accuracy-efficiency. By demonstrating transferable latent tools and leveraging contrastive rollouts, STIR decouples reasoning depth from sequence length while preserving coherence. The work contributes a practical, scalable pathway to more reliable, efficient internal reasoning in LLMs with broad implications for deployable AI systems.

Abstract

The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM-Latent-Action.

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

TL;DR

Abstract

Paper Structure (35 sections, 41 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 35 sections, 41 equations, 5 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Large Language Models Reasoning.
Implicit Reasoning.
Latent Representation Steering.
Preliminaries and Problem Formulation
Latent Dynamics of Generative Reasoning
The Challenge of Temporal Misalignment
Methodology
Differential Intrinsic Action Induction
Sparse Control Basis Construction
Value-Modulated Trajectory Intervention
Experiments
Experimental Setup
Datasets and Target Models.
...and 20 more sections

Figures (5)

Figure 1: Overview of the STIR framework. The pipeline operates through three stages: (1) differential intrinsic action induction distills latent steering impulses by analyzing the contrastive residuals between high-reward and low-reward rollouts at critical decision points; (2) sparse control basis construction filters these raw candidates into a geometrically diverse tool library to maximize representational coverage; and (3) value-modulated trajectory intervention acts as a runtime controller that retrieves relevant steering impulses and validates them via lookahead probing before dynamically injecting them into the residual stream.
Figure 2: t-SNE visualization of latent state embeddings extracted from stochastic rollouts. $\blacktriangle$ denotes failure states ($\mu^-$), while $\bullet$ represents the corresponding rectified states ($\mu^+$). The directed edges illustrate the steering impulses, revealing that error states form geometrically coherent clusters that can be bridged to the high-reward manifold via specific translational vectors.
Figure 3: Sensitivity analysis of key control hyperparameters on AMC23 and MATH500. Subplots (a) and (b) demonstrate the inverted-U relationship between injection strength $k_{\text{scale}}$ and reasoning accuracy. Subplots (c) and (d) illustrate the impact of normalized layer depth and reveal that steering interventions are most effective within the intermediate transformer blocks.
Figure 4: Cross-task generalization analysis. The heatmaps depict the transfer performance when a tool library distilled from a source dataset (y-axis) is applied to a target dataset (x-axis). The left panel reports reasoning accuracy, while the right panel shows the average token count. The strong performance in off-diagonal entries validates that STIR captures transferable latent tools that generalize across distinct tasks and domains, demonstrating robustness against specific problem distributions.
Figure 5: Token usage distribution on AMC 23. Comparison between STIR (green) and Vanilla (red) across four models with moderate ($k_{scale}=0.75$, top) and high ($k_{scale}=1.0$, bottom) injection strengths. The uniform leftward shift demonstrates that STIR effectively streamlines reasoning by bypassing redundant steps while simultaneously enhancing solution accuracy.

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

TL;DR

Abstract

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

Authors

TL;DR

Abstract

Table of Contents

Figures (5)