Table of Contents
Fetching ...

LightSeq: A High Performance Inference Library for Transformers

Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li

TL;DR

LightSeq presents a high-performance inference library for Transformer families by combining coarse-grain operation fusion, Hierarchical Auto Regressive Search, and dynamic GPU memory reuse. These techniques collectively reduce kernel launches, prune excessive computations in auto-regressive decoding, and minimize memory allocations for variable-length inputs. Empirical results show up to 14x speedups over TensorFlow and 1.4x over FasterTransformer on machine translation, with strong GPU utilization and favorable generation-time performance. The work demonstrates practical advantages for online services needing low-latency sequence processing and generation, with clear pathways for future enhancements such as integer-arithmetic inference and sparse GEMM.

Abstract

Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose LightSeq, a highly efficient inference library for models in the Transformer family. LightSeq includes a series of GPU optimization techniques to to streamline the computation of neural layers and to reduce memory footprint. LightSeq can easily import models trained using PyTorch and Tensorflow. Experimental results on machine translation benchmarks show that LightSeq achieves up to 14x speedup compared with TensorFlow and 1.4x compared with FasterTransformer, a concurrent CUDA implementation. The code is available at https://github.com/bytedance/lightseq.

LightSeq: A High Performance Inference Library for Transformers

TL;DR

LightSeq presents a high-performance inference library for Transformer families by combining coarse-grain operation fusion, Hierarchical Auto Regressive Search, and dynamic GPU memory reuse. These techniques collectively reduce kernel launches, prune excessive computations in auto-regressive decoding, and minimize memory allocations for variable-length inputs. Empirical results show up to 14x speedups over TensorFlow and 1.4x over FasterTransformer on machine translation, with strong GPU utilization and favorable generation-time performance. The work demonstrates practical advantages for online services needing low-latency sequence processing and generation, with clear pathways for future enhancements such as integer-arithmetic inference and sparse GEMM.

Abstract

Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose LightSeq, a highly efficient inference library for models in the Transformer family. LightSeq includes a series of GPU optimization techniques to to streamline the computation of neural layers and to reduce memory footprint. LightSeq can easily import models trained using PyTorch and Tensorflow. Experimental results on machine translation benchmarks show that LightSeq achieves up to 14x speedup compared with TensorFlow and 1.4x compared with FasterTransformer, a concurrent CUDA implementation. The code is available at https://github.com/bytedance/lightseq.

Paper Structure

This paper contains 11 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: The process of sequence to sequence generation using Transformer model with beam search.
  • Figure 2: The structure of optimized Transformer encoder layers in LightSeq .
  • Figure 3: An illustration of the proposed hierarchical strategy. In this case, beam size is 2 and vocabulary size is 8. Each row represents logits in a beam.
  • Figure 4: Proportion of computation occupation. GEMM is the main indicator and the larger number indicates the higher computation efficiency.
  • Figure 5: Speedup on Transformer with beam search compared with FasterTransformer, TurboTransformers and PyTorch implementation. The baseline is TensorFlow implementation.
  • ...and 1 more figures