Table of Contents
Fetching ...

UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA

Jiale Dong, Wenqi Lou, Zhendong Zheng, Yunji Qin, Lei Gong, Chao Wang, Xuehai Zhou

TL;DR

UbiMoE presents a dual-kernel FPGA accelerator for Mixture-of-Experts Vision Transformers, combining a latency-optimized streaming attention module with a resource-efficient reusable linear kernel. A two-stage heuristic search (GA plus binary search) enables effective hardware deployment across diverse FPGA resources, balancing latency and resource usage. Experimental results on ZCU102 and Alveo U280 show substantial gains in throughput and energy efficiency over prior FPGA designs and GPUs, validating the approach for MoE-ViT and suggesting broader applicability to standard transformers. The work advances practical MoE-ViT acceleration by addressing memory access patterns, load balancing for dynamic expert indices, and automated design-space exploration tailored to heterogeneous FPGA platforms.

Abstract

Compared to traditional Vision Transformers (ViT), Mixture-of-Experts Vision Transformers (MoE-ViT) are introduced to scale model size without a proportional increase in computational complexity, making them a new research focus. Given the high performance and reconfigurability, FPGA-based accelerators for MoE-ViT emerge, delivering substantial gains over general-purpose processors. However, existing accelerators often fall short of fully exploring the design space, leading to suboptimal trade-offs between resource utilization and performance. To overcome this problem, we introduce UbiMoE, a novel end-to-end FPGA accelerator tailored for MoE-ViT. Leveraging the unique computational and memory access patterns of MoE-ViTs, we develop a latency-optimized streaming attention kernel and a resource-efficient reusable linear kernel, effectively balancing performance and resource consumption. To further enhance design efficiency, we propose a two-stage heuristic search algorithm that optimally tunes hardware parameters for various FPGA resource constraints. Compared to state-of-the-art (SOTA) FPGA designs, UbiMoE achieves 1.34x and 3.35x throughput improvements for MoE-ViT on Xilinx ZCU102 and Alveo U280 platforms, respectively, while enhancing energy efficiency by 1.75x and 1.54x. Our implementation is available at https://github.com/DJ000011/UbiMoE.

UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA

TL;DR

UbiMoE presents a dual-kernel FPGA accelerator for Mixture-of-Experts Vision Transformers, combining a latency-optimized streaming attention module with a resource-efficient reusable linear kernel. A two-stage heuristic search (GA plus binary search) enables effective hardware deployment across diverse FPGA resources, balancing latency and resource usage. Experimental results on ZCU102 and Alveo U280 show substantial gains in throughput and energy efficiency over prior FPGA designs and GPUs, validating the approach for MoE-ViT and suggesting broader applicability to standard transformers. The work advances practical MoE-ViT acceleration by addressing memory access patterns, load balancing for dynamic expert indices, and automated design-space exploration tailored to heterogeneous FPGA platforms.

Abstract

Compared to traditional Vision Transformers (ViT), Mixture-of-Experts Vision Transformers (MoE-ViT) are introduced to scale model size without a proportional increase in computational complexity, making them a new research focus. Given the high performance and reconfigurability, FPGA-based accelerators for MoE-ViT emerge, delivering substantial gains over general-purpose processors. However, existing accelerators often fall short of fully exploring the design space, leading to suboptimal trade-offs between resource utilization and performance. To overcome this problem, we introduce UbiMoE, a novel end-to-end FPGA accelerator tailored for MoE-ViT. Leveraging the unique computational and memory access patterns of MoE-ViTs, we develop a latency-optimized streaming attention kernel and a resource-efficient reusable linear kernel, effectively balancing performance and resource consumption. To further enhance design efficiency, we propose a two-stage heuristic search algorithm that optimally tunes hardware parameters for various FPGA resource constraints. Compared to state-of-the-art (SOTA) FPGA designs, UbiMoE achieves 1.34x and 3.35x throughput improvements for MoE-ViT on Xilinx ZCU102 and Alveo U280 platforms, respectively, while enhancing energy efficiency by 1.75x and 1.54x. Our implementation is available at https://github.com/DJ000011/UbiMoE.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The structure of the MoE Vision Transformer.
  • Figure 2: The overall architecture of UbiMoE (except the QKV Generate and Norm kernels for simplicity).
  • Figure 3: Processing flow with double buffering.
  • Figure 4: Running process before and after optimization. Blue q blocks are fixed to specific PEs, while the color of k blocks changes during kernel running.
  • Figure 5: Implementation results of $\text{M}^3$ViT on both platforms.