Table of Contents
Fetching ...

Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models

Jordan Dotzel, Yash Akhauri, Ahmed S. AbouElhamayed, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang

TL;DR

Large language models face strict memory, latency, and power constraints, motivating dynamic sparsity to reduce compute on a per-input basis. The authors profile residual blocks in transformers and introduce Radial Networks, which route tokens through a dynamic subset of layers using a learnable router, potentially coupled with a unified cache to support sparse attention. Key contributions include a residual-ratio proxy for block importance, empirical evidence that residual contributions shrink with model size, and a router-based architecture that decouples layer count from dynamic depth, enabling scalable, lower-cost generation for very large models. This approach offers practical pathways to scale LLMs to trillion-parameter regimes while maintaining or improving throughput and efficiency.

Abstract

Large language models (LLMs) often struggle with strict memory, latency, and power demands. To meet these demands, various forms of dynamic sparsity have been proposed that reduce compute on an input-by-input basis. These methods improve over static methods by exploiting the variance across individual inputs, which has steadily grown with the exponential increase in training data. Yet, the increasing depth within modern models, currently with hundreds of layers, has opened opportunities for dynamic layer sparsity, which skips the computation for entire layers. In this work, we explore the practicality of layer sparsity by profiling residual connections and establish the relationship between model depth and layer sparsity. For example, the residual blocks in the OPT-66B model have a median contribution of 5% to its output. We then take advantage of this dynamic sparsity and propose Radial Networks, which perform token-level routing between layers guided by a trained router module. These networks can be used in a post-training distillation from sequential networks or trained from scratch to co-learn the router and layer weights. They enable scaling to larger model sizes by decoupling the number of layers from the dynamic depth of the network, and their design allows for layer reuse. By varying the compute token by token, they reduce the overall resources needed for generating entire sequences. Overall, this leads to larger capacity networks with significantly lower compute and serving costs for large language models.

Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models

TL;DR

Large language models face strict memory, latency, and power constraints, motivating dynamic sparsity to reduce compute on a per-input basis. The authors profile residual blocks in transformers and introduce Radial Networks, which route tokens through a dynamic subset of layers using a learnable router, potentially coupled with a unified cache to support sparse attention. Key contributions include a residual-ratio proxy for block importance, empirical evidence that residual contributions shrink with model size, and a router-based architecture that decouples layer count from dynamic depth, enabling scalable, lower-cost generation for very large models. This approach offers practical pathways to scale LLMs to trillion-parameter regimes while maintaining or improving throughput and efficiency.

Abstract

Large language models (LLMs) often struggle with strict memory, latency, and power demands. To meet these demands, various forms of dynamic sparsity have been proposed that reduce compute on an input-by-input basis. These methods improve over static methods by exploiting the variance across individual inputs, which has steadily grown with the exponential increase in training data. Yet, the increasing depth within modern models, currently with hundreds of layers, has opened opportunities for dynamic layer sparsity, which skips the computation for entire layers. In this work, we explore the practicality of layer sparsity by profiling residual connections and establish the relationship between model depth and layer sparsity. For example, the residual blocks in the OPT-66B model have a median contribution of 5% to its output. We then take advantage of this dynamic sparsity and propose Radial Networks, which perform token-level routing between layers guided by a trained router module. These networks can be used in a post-training distillation from sequential networks or trained from scratch to co-learn the router and layer weights. They enable scaling to larger model sizes by decoupling the number of layers from the dynamic depth of the network, and their design allows for layer reuse. By varying the compute token by token, they reduce the overall resources needed for generating entire sequences. Overall, this leads to larger capacity networks with significantly lower compute and serving costs for large language models.
Paper Structure (16 sections, 3 equations, 11 figures)

This paper contains 16 sections, 3 equations, 11 figures.

Figures (11)

  • Figure 1: Radial Networks -- Radial Networks generalize sequential networks for higher accuracy and performance. They take advantage of significant dynamic layer sparsity within modern LLMs and invoke only a subset of model layers for each token. They reduce the average network depth, lower model latency, and provide a more scalable neural architecture.
  • Figure 2: Dynamic Layer Sparsity -- As transformers grow larger, each layer contributes less to the output and shows significant variation on a token by token basis. Dynamically pruning these layers allows for models to grow significantly without corresponding increases in model latency. Early profiling results suggest that individual layers contribute around 1% within modern state-of-the-art language models.
  • Figure 3: Sparsity Granularity -- Bits form the basis for elements (weight or activations), which create blocks (rows, columns, heads), which then form individual layers. This leads to a sparsity spectrum where the smaller units are easier to prune without accuracy loss yet more difficult to accelerate. Layer sparsity has the highest potential for inference speedup and largest support within current hardware.
  • Figure 4: Residual Blocks -- There are two types of residual blocks within transformers, attention (ATT) and feed-forward network (FFN). These blocks offer natural points to profile layer strength since block inputs and outputs are combined at a single point. To establish an upper bound on the effectiveness of dynamic layer sparsity, oracles are inserted before each block that know the layer contribution beforehand.
  • Figure 5: Dynamic Depth -- The deeper layers in the network contribute more than the earlier layers, except for the very first layers. This relationship benefits from a routed architecture as opposed to early-exit, since early-exit skips the deeper layers. In addition, there is significant variance in the dynamic depth of the model, allowing for token-specific sparsity.
  • ...and 6 more figures