Table of Contents
Fetching ...

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference

Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff

TL;DR

Kraken presents a Transformer variant with fixed intra-layer parallelism to overlap inter-device collectives with computation, reducing inference latency on multi-device systems while preserving GPT-2–style language modeling. It derives configurations that maintain parameter budgets and uses a single end-of-layer AllReduce to minimize inter-device dependencies, enabling TTFT improvements measured with TensorRT-LLM on 8× A100 NVSwitch. Evaluations on OpenWebText and SuperGLUE show Kraken preserves language modeling capabilities with competitive perplexities and task performance. In multi-GPU deployments, Kraken delivers a geomean TTFT speedup of $35.6\%$, highlighting practical latency benefits for latency-sensitive applications.

Abstract

Large Transformer networks are increasingly used in settings where low inference latency can improve the end-user experience and enable new applications. However, autoregressive inference is resource intensive and requires parallelism for efficiency. Parallelism introduces collective communication that is both expensive and represents a phase when hardware resources are underutilized. Towards mitigating this, Kraken is an evolution of the standard Transformer architecture that is designed to complement existing tensor parallelism schemes for efficient inference on multi-device systems. By introducing a fixed degree of intra-layer model parallelism, the architecture allows collective operations to be overlapped with compute, decreasing latency and increasing hardware utilization. When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers while also preserving their language modeling capabilities when evaluated on the SuperGLUE benchmark. Importantly, when tested on multi-GPU systems using TensorRT-LLM engines, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes, context lengths, and degrees of tensor parallelism.

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference

TL;DR

Kraken presents a Transformer variant with fixed intra-layer parallelism to overlap inter-device collectives with computation, reducing inference latency on multi-device systems while preserving GPT-2–style language modeling. It derives configurations that maintain parameter budgets and uses a single end-of-layer AllReduce to minimize inter-device dependencies, enabling TTFT improvements measured with TensorRT-LLM on 8× A100 NVSwitch. Evaluations on OpenWebText and SuperGLUE show Kraken preserves language modeling capabilities with competitive perplexities and task performance. In multi-GPU deployments, Kraken delivers a geomean TTFT speedup of , highlighting practical latency benefits for latency-sensitive applications.

Abstract

Large Transformer networks are increasingly used in settings where low inference latency can improve the end-user experience and enable new applications. However, autoregressive inference is resource intensive and requires parallelism for efficiency. Parallelism introduces collective communication that is both expensive and represents a phase when hardware resources are underutilized. Towards mitigating this, Kraken is an evolution of the standard Transformer architecture that is designed to complement existing tensor parallelism schemes for efficient inference on multi-device systems. By introducing a fixed degree of intra-layer model parallelism, the architecture allows collective operations to be overlapped with compute, decreasing latency and increasing hardware utilization. When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers while also preserving their language modeling capabilities when evaluated on the SuperGLUE benchmark. Importantly, when tested on multi-GPU systems using TensorRT-LLM engines, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes, context lengths, and degrees of tensor parallelism.
Paper Structure (21 sections, 9 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: One layer of a standard Transformer consisting of Multi-Head Attention (also shown) followed by a FeedForward Network. Residual connections have been omitted.
  • Figure 2: Increasing the degree of tensor parallelism decreases the Time To First Token. Even when weights and KV cache fit on device memory, parallelism can be worthwhile. These numbers are for a 6.7B parameter GPT-3 like model and were collected using TensorRT-LLM engines on our evaluation platform: an HGX A100 40GB system.
  • Figure 3: Parallelizing two standard Transformer layers compared to executing two layers of a Kraken Transformer with 2-way parallelism. Kraken Transformers have fewer AllReduce ops and these can be run concurrently with the Multi-Head Attention in the next layer. Step lengths are illustrative and not indicative of how much wall-clock time a particular operation might actually require.
  • Figure 4: Speedup in Time To First Token over standard Transformers on a system that uses NVSwitch and with 4-way parallelism. x-axis labels denote the size of the model followed by the context length. Bar labels are in percentage.
  • Figure 5: Speedup in Time To First Token over standard Transformers on a system that uses NVSwitch and with 8-way parallelism. x-axis labels denote the size of the model followed by the context length. Bar labels are in percentage.
  • ...and 4 more figures