Table of Contents
Fetching ...

Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

Mohammad Mahdi Moradi, Walid Ahmed, Shuangyue Wen, Sudhir Mudur, Weiwei Zhang, Yang Liu

TL;DR

FlowHN tackles the quadratic cost of self-attention by pairing Attention and State Space Model branches in a parallel hybrid architecture. It introduces FLOP-aware token splitting and a fusion/projection module to balance compute and enhance representation expressivity, enabling more efficient token processing. Across 135M, 350M, and 1B parameter autoregressive models, FlowHN achieves up to 4x Tokens per Second and up to 2x Model FLOPs Utilization compared with existing hybrids, while maintaining or improving accuracy. This approach promises practical gains in efficiency for large-scale language models and motivates further exploration of dynamic routing and scalability beyond autoregressive tasks.

Abstract

Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).

Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

TL;DR

FlowHN tackles the quadratic cost of self-attention by pairing Attention and State Space Model branches in a parallel hybrid architecture. It introduces FLOP-aware token splitting and a fusion/projection module to balance compute and enhance representation expressivity, enabling more efficient token processing. Across 135M, 350M, and 1B parameter autoregressive models, FlowHN achieves up to 4x Tokens per Second and up to 2x Model FLOPs Utilization compared with existing hybrids, while maintaining or improving accuracy. This approach promises practical gains in efficiency for large-scale language models and motivates further exploration of dynamic routing and scalability beyond autoregressive tasks.

Abstract

Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).

Paper Structure

This paper contains 12 sections, 1 equation, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: The overall architecture of our proposed parallel hybrid model, FlowHN
  • Figure 2: Illustrative example of our four token-splitting strategies in action, showing how input tokens are partitioned and processed under each method.
  • Figure 3: Effect of model scale on TPS and MFU. Bars (left axis) show TPS, while lines (right axis) trace MFU for each model at 135M, 350M, and 1B parameters.
  • Figure 4: Accuracy as a function of model scale (135 M, 350 M, and 1 B parameters) comparing our approach against state-of-the-art hybrid models.