Table of Contents
Fetching ...

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Zongwu Wang, Fangxin Liu, Mingshuai Li, Li Jiang

TL;DR

TokenRing introduces a bidirectional, fine-grained sequence parallelism to address communication bottlenecks in attention for long-context LLMs. By partitioning Q, K, V across GPUs and overlapping Q transmissions with block outputs via a fully connected mesh, it improves computation-communication overlap and throughput, adaptable to NVLink/NVSwitch and Ascend interconnects. The approach is validated through case studies on Diffusion Transformers, LLM inference with zigzag causality, and multi-node setups, showing substantial reductions in communication time and better load balancing. This framework enables scalable, efficient distributed Transformer inference and training for very long sequences, with practical integration in xDIT and broad hardware compatibility. The work highlights future hardware-aware optimization to maximize performance in diverse interconnects and multi-node configurations.

Abstract

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention mechanisms. While sequence parallelism (SP) has been introduced as a potential solution, existing methods often suffer from limited scalability or inefficiency, rendering their effectiveness. Ring-Attention demonstrates the potential for scaling sequence processing but faces significant limitations due to its reliance on peer-to-peer (P2P) communication and inefficient utilization of network resources. As the degree of SP increases, the quadratic decrease in computation time per step contrasts sharply with the linear reduction in communication volume, exacerbating communication bottlenecks. To address these challenges, we propose TokenRing, a fine-grained parallel framework that leverages bidirectional P2P communication to effectively overlap computation and data transmission. By partitioning the attention block and concurrently transmitting Query and block outputs (i.e., $block\_out$ and $block\_lse$) within a fully connected mesh topology, TokenRing achieves significant reductions in communication overhead and better load balancing. These innovations improve the scalability and efficiency of distributed Transformer models, particularly for long-context sequences. Experimental results demonstrate that TokenRing enhances throughput and reduces communication latency. Moreover, its design adapts seamlessly to various multi-GPU interconnect solutions, such as Huawei Ascend, ensuring broad compatibility and cost-effectiveness for distributed LLM inference and training. The code is available at: \url{https://github.com/ACA-Lab-SJTU/token-ring}.

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

TL;DR

TokenRing introduces a bidirectional, fine-grained sequence parallelism to address communication bottlenecks in attention for long-context LLMs. By partitioning Q, K, V across GPUs and overlapping Q transmissions with block outputs via a fully connected mesh, it improves computation-communication overlap and throughput, adaptable to NVLink/NVSwitch and Ascend interconnects. The approach is validated through case studies on Diffusion Transformers, LLM inference with zigzag causality, and multi-node setups, showing substantial reductions in communication time and better load balancing. This framework enables scalable, efficient distributed Transformer inference and training for very long sequences, with practical integration in xDIT and broad hardware compatibility. The work highlights future hardware-aware optimization to maximize performance in diverse interconnects and multi-node configurations.

Abstract

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention mechanisms. While sequence parallelism (SP) has been introduced as a potential solution, existing methods often suffer from limited scalability or inefficiency, rendering their effectiveness. Ring-Attention demonstrates the potential for scaling sequence processing but faces significant limitations due to its reliance on peer-to-peer (P2P) communication and inefficient utilization of network resources. As the degree of SP increases, the quadratic decrease in computation time per step contrasts sharply with the linear reduction in communication volume, exacerbating communication bottlenecks. To address these challenges, we propose TokenRing, a fine-grained parallel framework that leverages bidirectional P2P communication to effectively overlap computation and data transmission. By partitioning the attention block and concurrently transmitting Query and block outputs (i.e., and ) within a fully connected mesh topology, TokenRing achieves significant reductions in communication overhead and better load balancing. These innovations improve the scalability and efficiency of distributed Transformer models, particularly for long-context sequences. Experimental results demonstrate that TokenRing enhances throughput and reduces communication latency. Moreover, its design adapts seamlessly to various multi-GPU interconnect solutions, such as Huawei Ascend, ensuring broad compatibility and cost-effectiveness for distributed LLM inference and training. The code is available at: \url{https://github.com/ACA-Lab-SJTU/token-ring}.
Paper Structure (17 sections, 1 equation, 6 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 1 equation, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Topology of OCP Accelerator Module.
  • Figure 2: Topology of the NVIDIA NVLink Switch System.
  • Figure 3: Comparison of the Ring-Attention (a) and TokenRing (b) overviews. In TokenRing, each GPU stores a single key-value block, while query blocks circulate through the ring for processing. The process begins with an initial query block, which is iterated over along with other query blocks. These query blocks, in conjunction with the key-value blocks, are utilized to compute self-attention using flash attention. Simultaneously, the ($block\_out$) and ($block\_lse$) are transmitted to the corresponding GPUs to update the outputs in reverse order.
  • Figure 4: Implementation of the TokenRing in xDIT Framework Utilizing Four GPUs.
  • Figure 5: TokenRing Implementation Across Two Nodes.
  • ...and 1 more figures