Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

Guowei Liu; Hongming Li; Yaning Guo; Yongxi Lyu; Mo Zhou; Yi Liu; Zhaogeng Li; Yanpeng Wang

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

Guowei Liu, Hongming Li, Yaning Guo, Yongxi Lyu, Mo Zhou, Yi Liu, Zhaogeng Li, Yanpeng Wang

TL;DR

A systematic analysis of Attention-FFN Disaggregation is conducted by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization, and shows that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment.

Abstract

Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

TL;DR

Abstract

Paper Structure (23 sections, 12 equations, 6 figures, 5 tables)

This paper contains 23 sections, 12 equations, 6 figures, 5 tables.

Introduction
Preliminary
Notations
AFD and Budget under 3BO
Performance Metrics
Trends in Modern MoE Architectures
System Analysis
Arithmetic Intensity in MoE Inference
AFD Dead Zone in Pursuit of HFU
Imbalance Penalty
DP Imbalance
EP Imbalance
Implementation Challenges
Implications for AFD-Favorable Configurations
Model-Level Considerations
...and 8 more sections

Figures (6)

Figure 1: Illustration of AFD architecture (left) and micro-batch overlap strategies (right). The AFD architecture physically separates attention and FFN computations, while the overlap strategy determines how micro-batches are pipelined to hide communication latency.
Figure 2: Normalized arithmetic intensity as a function of FFN node count ($N_F$) for DeepSeek-V3 on the H800 platform. The blue curve shows the theoretical upper bound, while the red one indicate actual values accounting for expert count discretization. The four distinct regions correspond to different bandwidth bottleneck regimes.
Figure 3: Grouped GEMM unit tests and theoretical roofline vs. M (average tokens per expert) on H20 and H200 platforms. The left column assumes balanced token distribution across experts, while the right column reflects realistic imbalanced scenarios where some experts receive disproportionately more tokens.
Figure 4: Theoretical upper-bound HFU of different models ernie2025technicalreportkimiteam2025kimik2openagenticqwen3technicalreport5team2025glm45agenticreasoningcodingdeepseekai2025deepseekv3technicalreportstep3system across hardware platforms under AFD deployment. The communication-bound HFU ceiling is highlighted in bold red. For all models, we assume MTP readiness with $L_{\text{accept}} = 1.7$, regardless of native support. All layers, dense or sparse, are assumed to have identical execution latency. We also assume a fixed deployment unit of 8 GPUs per node regardless of the physical Superpod scale. Detailed model and hardware configurations are provided in Appendix \ref{['appendix:model_and_hardware_configurations']}.
Figure 5: Comparison of DP and EP imbalance handling strategies between large-scale EP and AFD deployments. The key distinction lies in AFD's discrete scaling constraints versus large-scale EP's continuous adjustment capability.
...and 1 more figures

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

TL;DR

Abstract

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (6)