Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao
TL;DR
This paper addresses the inefficiency and non-stationarity of traditional reasoning augmentation in LLMs by internalizing reasoning into latent trajectories. It introduces STIR, a three-stage framework that distills latent reasoning successes into a sparse library of steering primitives and employs a value-modulated, anchor-based gating controller to intervene dynamically during inference. The approach yields consistent improvements in reasoning accuracy with substantial reductions in token usage across six benchmarks and multiple models, establishing a new Pareto frontier for accuracy-efficiency. By demonstrating transferable latent tools and leveraging contrastive rollouts, STIR decouples reasoning depth from sequence length while preserving coherence. The work contributes a practical, scalable pathway to more reliable, efficient internal reasoning in LLMs with broad implications for deployable AI systems.
Abstract
The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM-Latent-Action.
