Table of Contents
Fetching ...

Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Badri N. Patro, Suhas Ranganath, Vinay P. Namboodiri, Vijay S. Agneeswaran

TL;DR

Heracles tackles the challenge of high-cost, high-variance modeling for high-resolution data by merging a global Hartley-kernel state-space model with a parallel local convolutional SSM, followed by attention in deeper layers. This design leverages $O(N \log N)$-type efficiency for global interactions while preserving local detail through convolution, enabling a fall-back to a transformer-like capacity without the full $O(N^2)$ complexity. Empirically, it achieves state-of-the-art top-1 accuracy on ImageNet-1K across Small-to-Huge variants, strong transfer performance on CIFAR-10/100 and other datasets, and leading results on COCO instance segmentation; it also demonstrates top-tier performance on seven time-series benchmarks, indicating broad cross-domain generalization to spectral data. The results underscore the potential of real-valued spectral transformations within SSMs to complement local and global representations, offering a versatile and energy-efficient alternative to purely attention-based architectures.

Abstract

Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{https://github.com/badripatro/heracles}

Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

TL;DR

Heracles tackles the challenge of high-cost, high-variance modeling for high-resolution data by merging a global Hartley-kernel state-space model with a parallel local convolutional SSM, followed by attention in deeper layers. This design leverages -type efficiency for global interactions while preserving local detail through convolution, enabling a fall-back to a transformer-like capacity without the full complexity. Empirically, it achieves state-of-the-art top-1 accuracy on ImageNet-1K across Small-to-Huge variants, strong transfer performance on CIFAR-10/100 and other datasets, and leading results on COCO instance segmentation; it also demonstrates top-tier performance on seven time-series benchmarks, indicating broad cross-domain generalization to spectral data. The results underscore the potential of real-valued spectral transformations within SSMs to complement local and global representations, offering a versatile and energy-efficient alternative to purely attention-based architectures.

Abstract

Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{https://github.com/badripatro/heracles}
Paper Structure (29 sections, 13 equations, 5 figures, 16 tables)

This paper contains 29 sections, 13 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Heracles uses Global SSM using Hartley transform and Learnable Kernel for Capturing Global Features, Coupled with Local SSM Utilizing ConvNet for Precise Localization.
  • Figure 2: This figure illustrates the architectural details of the Heracles model with a SSM Layer and Transformer layer. The SSM Layer comprises a Global SSM using Hartley Transformation to capture global information and a Local SSM using a convolution operator to capture local information. Subsequently, It uses multi-headed attention for message communication between tokens. 'A' tends for Alternative, 'R' tends for Reverse.
  • Figure 3: Filter Characterisation: This figure shows the filter characterization of the initial four layers of the GFNet rao2021global and Heracles-C model. It clearly shows that most of the information in Heracles-C is concentrated in low-frequency regions of an Image
  • Figure 4: Here we compare the spectrum of real transforms such as cosine and hartley with complex Fourier transform. We show the energy compaction and concentration property of the real transformer over complex transforms of an Image. This shows that Heracles is amenable to model compression.
  • Figure 5: Comparison of ImageNet Top-1 Accuracy (%) vs Parameters (M) and Accuracy (%) vs GFLOPs of various models in Hierarchical architecture.