Table of Contents
Fetching ...

Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao

TL;DR

The paper tackles the latency bottleneck of tensor parallelism in large-model inference by introducing Ladder Residual, an architectural modification that decouples communication from computation to enable straightforward overlap. By rerouting residual connections so subsequent blocks receive a slightly stale input, the approach hides communication latency without rewriting low-level kernels. Empirical results show substantial end-to-end speedups (up to ~29% for 70B models on eight devices, with larger gains when interconnects are slower) and successful training-from-scratch and lightweight post-training adaptation on Llama-3.1-8B-Instruct, achieving comparable accuracy with meaningful speedups. The method is compatible with other parallelism strategies (PP, DDP, FSDP) and offers a simple, hardware-agnostic path toward co-designing model architectures with inference systems.

Abstract

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. We release our code for training and inference for easier replication of experiments.

Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

TL;DR

The paper tackles the latency bottleneck of tensor parallelism in large-model inference by introducing Ladder Residual, an architectural modification that decouples communication from computation to enable straightforward overlap. By rerouting residual connections so subsequent blocks receive a slightly stale input, the approach hides communication latency without rewriting low-level kernels. Empirical results show substantial end-to-end speedups (up to ~29% for 70B models on eight devices, with larger gains when interconnects are slower) and successful training-from-scratch and lightweight post-training adaptation on Llama-3.1-8B-Instruct, achieving comparable accuracy with meaningful speedups. The method is compatible with other parallelism strategies (PP, DDP, FSDP) and offers a simple, hardware-agnostic path toward co-designing model architectures with inference systems.

Abstract

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. We release our code for training and inference for easier replication of experiments.
Paper Structure (23 sections, 2 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 2 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of a standard Transformer block (left) and a Ladder Residual block (right). The blue edge denotes the residual connection. In Ladder Residual, the residual connection remains the same while each module $h_i$ takes the stale input $r_{i-2}$.
  • Figure 2: Improvement in end-to-end inference throughput achieved by communication-efficient architectures relative to a standard transformer, benchmarked on Llama-3 70B. Standard refers to the regular Llama-3, and Ladder is Llama-3 with our Ladder Residual architecture. Ladder Residual architecture can achieve up to $29\%$ greater throughput than the standard Transformer. With slower communication (P2P disabled or P2P=0), we observe speedups up to $60\%$. All experiments were conducted on a generation task with $1024$ prompt tokens and $512$ completion tokens. Missing data points indicate CUDA OOM.
  • Figure 3: End-to-end inference throughput improvement on Llama-3-405B on a generation task with $1024$ prompt tokens and $512$ completion tokens. Here we use TP size 16 across two nodes each with 8 GPUs, connected with InfiniBand. Due to the high cost of cross-node communication, Ladder Residual architecture is able to achieve more than $30\%$ improvement across various batch sizes with P2P enabled and around $50\%$ with P2P communication disabled.
  • Figure 4: Pareto frontier of completion latency vs aggregate throughput per GPU for different 70B-scale model architectures in a batched inference setting. For each architecture, we sweep over both batch size and TP world size to find the Pareto-optimal configurations. Using less TP size results in higher throughput while using a higher TP size optimizes the latency, both have its use-case and we found that ladder architecture achieves Pareto improvements over the standard transformer architecture and the parallel transformer. All experiments measure end-to-end time on a generation task with $1024$ prompt tokens and $512$ completion tokens per sequence.
  • Figure 5: Traces generated by PyTorch Profiler. As shown in the plot for Standard transformer the NCCL operations block the computation whereas in Ladder Transformer the NCCL operations can be overlapped with the computation.