Exploring the Hierarchical Reasoning Model for Small Natural-Image Classification Without Augmentation
Alexander V. Mantzaris
TL;DR
The paper evaluates the Hierarchical Reasoning Model (HRM) as a practical image classifier under a deliberately no-augmentation regime, using two Transformer-style modules, a DEQ-based one-step gradient, deep supervision, and modern normalization/positional techniques. It compares HRM to a conventional CNN baseline on MNIST, CIFAR-10, and CIFAR-100, revealing strong MNIST performance but substantial generalization gaps on CIFAR-10/100 due to insufficient image-specific inductive bias. The results indicate that HRM can train stably with small parameter budgets, but without augmentation or additional inductive structure it underperforms simple convolutional architectures on small natural images. The work highlights potential directions for improving HRM, such as architectural tweaks to bolster image priors and regularization in the no-augmentation setting, to realize its theoretical advantages in practical classification tasks.
Abstract
This paper asks whether the Hierarchical Reasoning Model (HRM) with the two Transformer-style modules $(f_L,f_H)$, one step (DEQ-style) training, deep supervision, Rotary Position Embeddings, and RMSNorm can serve as a practical image classifier. It is evaluated on MNIST, CIFAR-10, and CIFAR-100 under a deliberately raw regime: no data augmentation, identical optimizer family with one-epoch warmup then cosine-floor decay, and label smoothing. HRM optimizes stably and performs well on MNIST ($\approx 98\%$ test accuracy), but on small natural images it overfits and generalizes poorly: on CIFAR-10, HRM reaches 65.0\% after 25 epochs, whereas a two-stage Conv--BN--ReLU baseline attains 77.2\% while training $\sim 30\times$ faster per epoch; on CIFAR-100, HRM achieves only 29.7\% test accuracy despite 91.5\% train accuracy, while the same CNN reaches 45.3\% test with 50.5\% train accuracy. Loss traces and error analyses indicate healthy optimization but insufficient image-specific inductive bias for HRM in this regime. It is concluded that, for small-resolution image classification without augmentation, HRM is not competitive with even simple convolutional architectures as the HRM currently exist but this does not exclude possibilities that modifications to the model may allow it to improve greatly.
