Table of Contents
Fetching ...

Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

Jerome Ku, Eric Nguyen, David W. Romero, Garyk Brixi, Brandon Yang, Anton Vorontsov, Ali Taghibakhshi, Amy X. Lu, Dave P. Burke, Greg Brockman, Stefano Massaroli, Christopher Ré, Patrick D. Hsu, Brian L. Hie, Stefano Ermon, Michael Poli

TL;DR

The work targets scalable language modeling by moving beyond Transformers to convolutional multi-hybrids that combine input-dependent convolutions with complementary operators. It introduces StripedHyena 2, a convolutional multi-hybrid architecture built from Hyena-SE, Hyena-MR, and Hyena-LI, optimized with hardware-aware block kernels and context-parallel strategies including all-to-all and point-to-point CP. The paper presents the two-stage block convolution technique, wipe-clean block layouts, and grouped filter sharing to maximize tensor-core throughput, achieving speedups over both Transformers and prior hybrids, and enabling long-context modeling up to $1{,}000{,}000$ tokens. Evo 2 demonstrates the practical efficacy of the approach on byte-tokenized genomic data, with 40B parameter models trained on trillions of tokens. The contributions include architectural design, kernel-level implementations, context-parallel algorithms, scaling results, and open-source tooling (Savanna) to support research in convolutional multi-hybrids.

Abstract

We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression, with input-dependent convolutions and attention offering complementary performance. Second, co-designing convolution operators and hardware-aware algorithms enables efficiency gains in regimes where previous alternative architectures struggle to surpass Transformers. At the 40 billion parameter scale, we train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids. On H100 GPUs and model width 4096, individual operators in the proposed multi-hybrid StripedHyena 2 architecture achieve two-fold throughput improvement over linear attention and state-space models. Multi-hybrids excel at sequence modeling over byte-tokenized data, as demonstrated by the Evo 2 line of models. We discuss the foundations that enable these results, including architecture design, overlap-add blocked kernels for tensor cores, and dedicated all-to-all and point-to-point context parallelism strategies.

Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

TL;DR

The work targets scalable language modeling by moving beyond Transformers to convolutional multi-hybrids that combine input-dependent convolutions with complementary operators. It introduces StripedHyena 2, a convolutional multi-hybrid architecture built from Hyena-SE, Hyena-MR, and Hyena-LI, optimized with hardware-aware block kernels and context-parallel strategies including all-to-all and point-to-point CP. The paper presents the two-stage block convolution technique, wipe-clean block layouts, and grouped filter sharing to maximize tensor-core throughput, achieving speedups over both Transformers and prior hybrids, and enabling long-context modeling up to tokens. Evo 2 demonstrates the practical efficacy of the approach on byte-tokenized genomic data, with 40B parameter models trained on trillions of tokens. The contributions include architectural design, kernel-level implementations, context-parallel algorithms, scaling results, and open-source tooling (Savanna) to support research in convolutional multi-hybrids.

Abstract

We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression, with input-dependent convolutions and attention offering complementary performance. Second, co-designing convolution operators and hardware-aware algorithms enables efficiency gains in regimes where previous alternative architectures struggle to surpass Transformers. At the 40 billion parameter scale, we train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids. On H100 GPUs and model width 4096, individual operators in the proposed multi-hybrid StripedHyena 2 architecture achieve two-fold throughput improvement over linear attention and state-space models. Multi-hybrids excel at sequence modeling over byte-tokenized data, as demonstrated by the Evo 2 line of models. We discuss the foundations that enable these results, including architecture design, overlap-add blocked kernels for tensor cores, and dedicated all-to-all and point-to-point context parallelism strategies.

Paper Structure

This paper contains 55 sections, 24 equations, 16 figures, 5 tables, 1 algorithm.

Figures (16)

  • Figure 2.1: Overview of the convolutional operators forming the basis of StripedHyena 2: Hyena-SE (short explicit filters), Hyena-MR (medium regularized filters), Hyena-LI (long implicit filters). All operators use the Hyena structure poli2023hyena, tailoring the inner convolution parametrization for an improved balance of quality and efficiency. Given these operators, we explore different striped layouts.
  • Figure 2.2: End-to-end iteration times (forward and backward) during training, collected on a large cluster of H100 SXM GPUs. See Table \ref{['tab:appendix_scaling']} for details on the measurement protocol.
  • Figure 3.1: Forward latency and TFLOPS / second of Hyena-MR variants with filter length $128$. We compare a baseline implementation using PyTorch convolutions and our two-stage blocked kernel, showing substantial improvements in latency and throughput.
  • Figure 3.2: Forward latency and TFLOPs / second of Hyena-SE, Hyena-MR and other common operators in architecture design: multi-head attention (MHA) and linear attention variants. All values are collected at operator width $4096$ (corresponding to model width at $7$B parameters), on H100s. For MHA, we report both a highly optimized implementation for Hopper GPUs (PyTorch SDPA) as well as a previous generation implementation not optimized for Hopper GPUs (FlashAttention2) dao2023flashattention. All other operators use their official auto-tuned Triton kernels. Convolutional primitives remain efficient across sequence lengths, with substantially higher throughput than other operators, including efficient alternatives to MHA.
  • Figure 4.1: Diagram of computation and communication in all-to-all convolutions. This context parallelism strategy can be used in both inner hyena convolutions (corresponding to multiplication with $G_{t t'}$, Eq. \ref{['eq:hyena_structure']}) or featurizer convolutions ($T_{tt'}$, $H_{tt'}$, $K_{tt'}$). Filters are stored or computed in each context parallel rank to avoid communication overheads. The convolution inside the context parallel region can be computed with any algorithm e.g., FFT-based or direct.
  • ...and 11 more figures