Table of Contents
Fetching ...

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

TL;DR

This work tackles reasoning under parameter constraints by identifying latent overthinking in fixed-depth recurrent transformers and proposing Think-at-Hard (TaH), a method that selectively deepens only hard tokens. TaH combines a neural iteration decider, duo-causal attention for cross-depth information flow, and depth-specific LoRA adapters, trained via a stable two-stage, oracle-guided scheme. Across five challenging math reasoning benchmarks, TaH achieves consistent accuracy gains with minimal parameter and computational overhead, and ablations validate the importance of architectural choices and training strategy. The approach offers a practical path to enhanced reasoning in resource-constrained LLMs, with potential for further gains through extended depths and online supervision.

Abstract

Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

TL;DR

This work tackles reasoning under parameter constraints by identifying latent overthinking in fixed-depth recurrent transformers and proposing Think-at-Hard (TaH), a method that selectively deepens only hard tokens. TaH combines a neural iteration decider, duo-causal attention for cross-depth information flow, and depth-specific LoRA adapters, trained via a stable two-stage, oracle-guided scheme. Across five challenging math reasoning benchmarks, TaH achieves consistent accuracy gains with minimal parameter and computational overhead, and ablations validate the importance of architectural choices and training strategy. The approach offers a practical path to enhanced reasoning in resource-constrained LLMs, with potential for further gains through extended depths and online supervision.

Abstract

Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

Paper Structure

This paper contains 34 sections, 18 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Selective iteration can mitigate latent overthinking. (a) Toy example. Uniform latent iteration (always think-twice) can fix wrong predictions, but may also overthink and corrupt correct ones. (b) Next-token prediction accuracy of finetuned Qwen3-1.7B variants. Always think-twice causes more errors than corrections over direct reply. In contrast, the think-at-hard oracle, which iterates only when the first-pass prediction is wrong, achieves substantial improvements with minimal harm. While this oracle signal is unavailable in practice, it highlights the potential of selective iteration.
  • Figure 2: TaH Overview. (a) Regular causal attention: tokens attend only to previous positions. (b) Our duo-causal attention: tokens attend to both previous positions and shallower iteration depths, maintaining 2D causality. (c) Model architecture: TaH selectively iterates or verbalizes tokens. It uses LoRA at deeper iterations to shift from next-token prediction to hard-token refinement. A neural decider determines whether to continue iterating or output the token.
  • Figure 3: Training dynamics of the LLM backbone on Qwen3-0.6B-Base. TaH converges rapidly and achieves lower perplexity.
  • Figure 4: Impact of iteration strategies on Qwen3-0.6B (first 100 MATH500 samples).
  • Figure 5: Next-token prediction changes across iterations. Top2 tokens that think-twice most are visualized.
  • ...and 3 more figures