Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis
Badri N. Patro, Suhas Ranganath, Vinay P. Namboodiri, Vijay S. Agneeswaran
TL;DR
Heracles tackles the challenge of high-cost, high-variance modeling for high-resolution data by merging a global Hartley-kernel state-space model with a parallel local convolutional SSM, followed by attention in deeper layers. This design leverages $O(N \log N)$-type efficiency for global interactions while preserving local detail through convolution, enabling a fall-back to a transformer-like capacity without the full $O(N^2)$ complexity. Empirically, it achieves state-of-the-art top-1 accuracy on ImageNet-1K across Small-to-Huge variants, strong transfer performance on CIFAR-10/100 and other datasets, and leading results on COCO instance segmentation; it also demonstrates top-tier performance on seven time-series benchmarks, indicating broad cross-domain generalization to spectral data. The results underscore the potential of real-valued spectral transformations within SSMs to complement local and global representations, offering a versatile and energy-efficient alternative to purely attention-based architectures.
Abstract
Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{https://github.com/badripatro/heracles}
