Table of Contents
Fetching ...

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Ravi Ghadia, Maksim Abraham, Sergei Vorobyov, Max Ryabinin

TL;DR

UPipe is presented, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level that significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths.

Abstract

Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5$\%$ for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8$\times$H100 node, improving upon prior methods by over 25$\%$.

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

TL;DR

UPipe is presented, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level that significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths.

Abstract

Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5 for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8H100 node, improving upon prior methods by over 25.
Paper Structure (25 sections, 6 figures, 6 tables)

This paper contains 25 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of context parallelism approaches on long-sequence training for Llama 3-8B using 8 $\times$ H100s. UPipe provides maximum efficiency, resulting in longer maximum context length (5M tokens) while retaining throughput.
  • Figure 2: Memory usage breakdown when training Llama3-8B with a sequence length of 3M tokens across 8 H100 GPUs. AC stands for Activation Checkpointing, AO denotes AC with offloading, OOM stands for Out of Memory.
  • Figure 3: Illustration of (a) DeepSpeed-Ulysses and (b) UPipe designs. UPipe processes attention in a headwise untied manner, so that in each stage, attention is performed only on a subset of heads. This allows memory reuse across different stages, significantly reducing the peak memory usage due to attention activations. HBM (High-Bandwidth Memory) usage illustrates memory utilization due to the intermediate buffers and omits other components for brevity.
  • Figure 4: Illustration of UPipe's GQA scheduling algorithm. We communicate as many unique key/value heads as possible along with the corresponding queries in stage-0. In the subsequent stages, we only communicate the next queries of the corresponding groups, reusing the key/value tensors from stage-0 until stage-$G$, where $G$ is the group size.
  • Figure 5: Llama3-8B: Peak GPU memory usage and throughput (normalized w.r.t USP-Hybrid) comparison of UPipe and USP-Hybrid at different sequence lengths on 16$\times$H100s. Our method significantly outperforms USP-Hybrid in terms of memory efficiency, and allows a maximum context size of 8M, improving upon USP-Hybrid (6M tokens) by 33%, while maintaining throughput.
  • ...and 1 more figures