GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

Chengming Zhang; Xinheng Ding; Baixi Sun; Xiaodong Yu; Weijian Zheng; Zhen Xie; Dingwen Tao

GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

Chengming Zhang, Xinheng Ding, Baixi Sun, Xiaodong Yu, Weijian Zheng, Zhen Xie, Dingwen Tao

TL;DR

Transformers on Gaudi face Softmax and heterogeneous-hardware inefficiencies. GFormer addresses this by integrating sparse windowed attention on TPC and a linear attention path mapped to the MME, coupled with an optimized outer-product kernel on the TPC; the partition algorithm balances workloads via a tau-based head split to equalize MME and TPC runtimes, all while preserving accuracy. Key contributions include the windowed Sparse Attention kernel, the efficient Outer Product kernel for causal linear attention, and the Optimal Partition Algorithm, with experimental results showing up to 2× speedups on GPT and ViT benchmarks and competitive accuracy relative to baselines. The work demonstrates hardware-aware acceleration on Gaudi for large-scale Transformer inference, offering practical gains for real-world LLM and vision tasks.

Abstract

Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor's Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear attention, aiming to optimize LLM inference on Gaudi processors. Evaluation shows that GFormer significantly improves efficiency and model performance across various tasks on the Gaudi processor and outperforms state-of-the-art GPUs.

GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 4 equations, 11 figures, 3 tables, 2 algorithms.

Design Methodology
Overview of This Work
TPC Best Fitted Sparse Attention
Efficient Outer Product on TPC
Optimal Partition Algorithm
Performance Evaluation
Experimental Setup
Platforms
Implementation details
Models
Evaluation on Speedup of Sparse Attention Kernel
Evaluation on Performance of Outer Product kernel
Evaluation on Partition
Evaluation on Mixed Attention
Evaluation on Speedup and Accuracy
...and 7 more sections

Figures (11)

Figure 1: Profiling result of original causal linear attention.
Figure 2: Profiling result of optimized causal linear attention.
Figure 3: Speedup of windowed attention kernel.
Figure 4: Performance of Outer Product Kernel.
Figure 5: Performance of different partition. Causal and self are short for causal attention and self-attention.
...and 6 more figures

GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

TL;DR

Abstract

GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

Authors

TL;DR

Abstract

Table of Contents

Figures (11)