Table of Contents
Fetching ...

Hierarchical Reasoning Models: Perspectives and Misconceptions

Renee Ge, Qianli Liao, Tomaso Poggio

TL;DR

The paper analyzes Hierarchical Reasoning Model (HRM) as a latent-space, recurrent approach to improve logical reasoning in transformers. It interrogates key design choices—L/H modules, one-step gradient training, and Adaptive Computation Time—via ablations on a Sudoku task. Findings show the high-level H module adds little beyond a strong L-module, HRM's training aligns with diffusion-like latent consistency, and ACT halting does not enhance inference when maximum steps are used. Collectively, these results challenge the necessity of hierarchical recurrence and encourage further exploration of latent-consistency and adaptive computation methods for reasoning tasks.

Abstract

Transformers have demonstrated remarkable performance in natural language processing and related domains, as they largely focus on sequential, autoregressive next-token prediction tasks. Yet, they struggle in logical reasoning, not necessarily because of a fundamental limitation of these models, but possibly due to the lack of exploration of more creative uses, such as latent space and recurrent reasoning. An emerging exploration in this direction is the Hierarchical Reasoning Model (Wang et. al., 2025), which introduces a novel type of recurrent reasoning in the latent space of transformers, achieving remarkable performance on a wide range of 2D reasoning tasks. Despite the promising results, this line of models is still at an early stage and calls for in-depth investigation. In this work, we review this class of models, examine key design choices, test alternative variants and clarify common misconceptions.

Hierarchical Reasoning Models: Perspectives and Misconceptions

TL;DR

The paper analyzes Hierarchical Reasoning Model (HRM) as a latent-space, recurrent approach to improve logical reasoning in transformers. It interrogates key design choices—L/H modules, one-step gradient training, and Adaptive Computation Time—via ablations on a Sudoku task. Findings show the high-level H module adds little beyond a strong L-module, HRM's training aligns with diffusion-like latent consistency, and ACT halting does not enhance inference when maximum steps are used. Collectively, these results challenge the necessity of hierarchical recurrence and encourage further exploration of latent-consistency and adaptive computation methods for reasoning tasks.

Abstract

Transformers have demonstrated remarkable performance in natural language processing and related domains, as they largely focus on sequential, autoregressive next-token prediction tasks. Yet, they struggle in logical reasoning, not necessarily because of a fundamental limitation of these models, but possibly due to the lack of exploration of more creative uses, such as latent space and recurrent reasoning. An emerging exploration in this direction is the Hierarchical Reasoning Model (Wang et. al., 2025), which introduces a novel type of recurrent reasoning in the latent space of transformers, achieving remarkable performance on a wide range of 2D reasoning tasks. Despite the promising results, this line of models is still at an early stage and calls for in-depth investigation. In this work, we review this class of models, examine key design choices, test alternative variants and clarify common misconceptions.

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: Comparison between original HRM (4-layer H-module and 4-layer L-module and HRM with a 8-layer L-module only (L-cycle=1). The latter is equivalent to a 8-layer plain transformer. The x axis is the number of training steps/iterations (i.e., number of minibatches). The plain transformer/L-module-only setting performs similarly or slightly better than the original HRM while at the same time runs much faster (runtime 1h 48m vs. original HRM's 4h 21m on a A100 GPU).
  • Figure 2: Reasoning with same inference steps for all examples on the Sudoku task. The x-axis represents the number of reasoning steps, which is kept the same across all examples. We then plot the performance curves for token accuracy and exact accuracy, with exact accuracy being 1 for a fully correct sequence and 0 otherwise for each example. The original HRM is used as the model architecture. The observation that using the same number of reasoning steps on all examples can help raise a question about the recurrent nature of the HRM paradigm. Is it truly recurrent or does it function effectively as a very deep feedforward model?
  • Figure 3: Sample-specific reasoning in inference with different strategies. In contrast to Figure \ref{['fig:same_step']}, in this setting every example has a different number of reasoning steps, determined by the halting logits ($\hat{Q}_{halt}$ and $\hat{Q}_{continue}$) generated by the model. A model decides to halt using two strategies: 1. when $sigmoid(\hat{Q}_{halt}) > threshold$. 2. when $sigmoid(\hat{Q}_{halt} - \hat{Q}_{continue}) > threshold$. The x-axes of the 3 subfigures represent the threshold. The y-axes represent token accuracy, exact accuracy and average halting steps, respectively. The original HRM is used as the model architecture.
  • Figure 4: Example sample performances with different reasoning steps. Token accuray (left column), exact accuracy (middle column) and output norm (right column). These figures shows the performances of a selected number of examples on various reasoning steps. The second row shows the statistics from 10 examples with best performance. The third row shows the statistics from 10 random examples.
  • Figure 5: Some concrete examples of diffusion-like reasoning with very few steps. This is in sharp contrast to how humans reason, posing an interesting question on the nature of reasoning.