Table of Contents
Fetching ...

Exploring the Hierarchical Reasoning Model for Small Natural-Image Classification Without Augmentation

Alexander V. Mantzaris

TL;DR

The paper evaluates the Hierarchical Reasoning Model (HRM) as a practical image classifier under a deliberately no-augmentation regime, using two Transformer-style modules, a DEQ-based one-step gradient, deep supervision, and modern normalization/positional techniques. It compares HRM to a conventional CNN baseline on MNIST, CIFAR-10, and CIFAR-100, revealing strong MNIST performance but substantial generalization gaps on CIFAR-10/100 due to insufficient image-specific inductive bias. The results indicate that HRM can train stably with small parameter budgets, but without augmentation or additional inductive structure it underperforms simple convolutional architectures on small natural images. The work highlights potential directions for improving HRM, such as architectural tweaks to bolster image priors and regularization in the no-augmentation setting, to realize its theoretical advantages in practical classification tasks.

Abstract

This paper asks whether the Hierarchical Reasoning Model (HRM) with the two Transformer-style modules $(f_L,f_H)$, one step (DEQ-style) training, deep supervision, Rotary Position Embeddings, and RMSNorm can serve as a practical image classifier. It is evaluated on MNIST, CIFAR-10, and CIFAR-100 under a deliberately raw regime: no data augmentation, identical optimizer family with one-epoch warmup then cosine-floor decay, and label smoothing. HRM optimizes stably and performs well on MNIST ($\approx 98\%$ test accuracy), but on small natural images it overfits and generalizes poorly: on CIFAR-10, HRM reaches 65.0\% after 25 epochs, whereas a two-stage Conv--BN--ReLU baseline attains 77.2\% while training $\sim 30\times$ faster per epoch; on CIFAR-100, HRM achieves only 29.7\% test accuracy despite 91.5\% train accuracy, while the same CNN reaches 45.3\% test with 50.5\% train accuracy. Loss traces and error analyses indicate healthy optimization but insufficient image-specific inductive bias for HRM in this regime. It is concluded that, for small-resolution image classification without augmentation, HRM is not competitive with even simple convolutional architectures as the HRM currently exist but this does not exclude possibilities that modifications to the model may allow it to improve greatly.

Exploring the Hierarchical Reasoning Model for Small Natural-Image Classification Without Augmentation

TL;DR

The paper evaluates the Hierarchical Reasoning Model (HRM) as a practical image classifier under a deliberately no-augmentation regime, using two Transformer-style modules, a DEQ-based one-step gradient, deep supervision, and modern normalization/positional techniques. It compares HRM to a conventional CNN baseline on MNIST, CIFAR-10, and CIFAR-100, revealing strong MNIST performance but substantial generalization gaps on CIFAR-10/100 due to insufficient image-specific inductive bias. The results indicate that HRM can train stably with small parameter budgets, but without augmentation or additional inductive structure it underperforms simple convolutional architectures on small natural images. The work highlights potential directions for improving HRM, such as architectural tweaks to bolster image priors and regularization in the no-augmentation setting, to realize its theoretical advantages in practical classification tasks.

Abstract

This paper asks whether the Hierarchical Reasoning Model (HRM) with the two Transformer-style modules , one step (DEQ-style) training, deep supervision, Rotary Position Embeddings, and RMSNorm can serve as a practical image classifier. It is evaluated on MNIST, CIFAR-10, and CIFAR-100 under a deliberately raw regime: no data augmentation, identical optimizer family with one-epoch warmup then cosine-floor decay, and label smoothing. HRM optimizes stably and performs well on MNIST ( test accuracy), but on small natural images it overfits and generalizes poorly: on CIFAR-10, HRM reaches 65.0\% after 25 epochs, whereas a two-stage Conv--BN--ReLU baseline attains 77.2\% while training faster per epoch; on CIFAR-100, HRM achieves only 29.7\% test accuracy despite 91.5\% train accuracy, while the same CNN reaches 45.3\% test with 50.5\% train accuracy. Loss traces and error analyses indicate healthy optimization but insufficient image-specific inductive bias for HRM in this regime. It is concluded that, for small-resolution image classification without augmentation, HRM is not competitive with even simple convolutional architectures as the HRM currently exist but this does not exclude possibilities that modifications to the model may allow it to improve greatly.

Paper Structure

This paper contains 10 sections, 13 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Training loss across optimization steps. The lightly shaded trace is the instantaneous per–segment step loss; the darker trace is a moving average for readability. Loss stabilizes after the initial descent, aligning with the steady gains in accuracy.
  • Figure 2: Misclassified MNIST examples. Each tile shows the true class $T$ and the predicted class $P$. Confusions concentrate on digit pairs with similar topology (for example, curled tails in $9$ vs. closed loops in $0$; angled $7$ vs. vertical $1$).
  • Figure 3: HRM on CIFAR-10: training loss per step (light) with a moving-average smoothed trace (dark). The curve decreases smoothly without instabilities.
  • Figure 4: HRM on CIFAR-10: examples misclassified by the final model. Each tile shows the ground-truth (T) and prediction (P) indices.
  • Figure 5: CNN on CIFAR-10: training loss per step (light) with a moving-average smoothed trace.
  • ...and 5 more figures