Table of Contents
Fetching ...

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

Guowei Liu, Hongming Li, Yaning Guo, Yongxi Lyu, Mo Zhou, Yi Liu, Zhaogeng Li, Yanpeng Wang

TL;DR

A systematic analysis of Attention-FFN Disaggregation is conducted by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization, and shows that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment.

Abstract

Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

TL;DR

A systematic analysis of Attention-FFN Disaggregation is conducted by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization, and shows that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment.

Abstract

Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.
Paper Structure (23 sections, 12 equations, 6 figures, 5 tables)

This paper contains 23 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of AFD architecture (left) and micro-batch overlap strategies (right). The AFD architecture physically separates attention and FFN computations, while the overlap strategy determines how micro-batches are pipelined to hide communication latency.
  • Figure 2: Normalized arithmetic intensity as a function of FFN node count ($N_F$) for DeepSeek-V3 on the H800 platform. The blue curve shows the theoretical upper bound, while the red one indicate actual values accounting for expert count discretization. The four distinct regions correspond to different bandwidth bottleneck regimes.
  • Figure 3: Grouped GEMM unit tests and theoretical roofline vs. M (average tokens per expert) on H20 and H200 platforms. The left column assumes balanced token distribution across experts, while the right column reflects realistic imbalanced scenarios where some experts receive disproportionately more tokens.
  • Figure 4: Theoretical upper-bound HFU of different models ernie2025technicalreportkimiteam2025kimik2openagenticqwen3technicalreport5team2025glm45agenticreasoningcodingdeepseekai2025deepseekv3technicalreportstep3system across hardware platforms under AFD deployment. The communication-bound HFU ceiling is highlighted in bold red. For all models, we assume MTP readiness with $L_{\text{accept}} = 1.7$, regardless of native support. All layers, dense or sparse, are assumed to have identical execution latency. We also assume a fixed deployment unit of 8 GPUs per node regardless of the physical Superpod scale. Detailed model and hardware configurations are provided in Appendix \ref{['appendix:model_and_hardware_configurations']}.
  • Figure 5: Comparison of DP and EP imbalance handling strategies between large-scale EP and AFD deployments. The key distinction lies in AFD's discrete scaling constraints versus large-scale EP's continuous adjustment capability.
  • ...and 1 more figures