Table of Contents
Fetching ...

M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

Nikhil Bhendawade, Mahyar Najibi, Devang Naik, Irina Belousova

TL;DR

The paper tackles the inefficiency of static residual transformations in autoregressive transformers by introducing Mixture of Multi-rate Residuals (M2R2), which dynamically modulates residual velocity to achieve earlier alignment of token representations. By pairing a slow residual stream with a rate-accelerated parallel residual stream, M2R2 employs accelerator adapters, shared KV caches, and ARLA to improve exit decisions, speculative decoding, and MoE AoT loading, while applying FLOPs-aware optimizations. It delivers substantial empirical gains across reasoning benchmarks and MoE setups, including up to 2.8x speedups in self-speculative decoding and up to 2.9x in AoT loading, without costly pre-training. The approach offers a practical route to more efficient on-device or resource-constrained decoding for both dense and sparse Transformer models, with broad applicability to dynamic compute, speculative decoding, and MoE inference.

Abstract

Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade-off between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment, enhancing inference efficiency. Evaluations on reasoning oriented tasks such as Koala, Self-Instruct, WizardLM, and MT-Bench show M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup. In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench, outperforming methods like 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, integrating early residual alignment with ahead-of-time expert loading into high-bandwidth memory (HBM) accelerates decoding, reduces expert-switching bottlenecks, and achieves a 2.9x speedup, making it highly effective in resource-constrained environments.

M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

TL;DR

The paper tackles the inefficiency of static residual transformations in autoregressive transformers by introducing Mixture of Multi-rate Residuals (M2R2), which dynamically modulates residual velocity to achieve earlier alignment of token representations. By pairing a slow residual stream with a rate-accelerated parallel residual stream, M2R2 employs accelerator adapters, shared KV caches, and ARLA to improve exit decisions, speculative decoding, and MoE AoT loading, while applying FLOPs-aware optimizations. It delivers substantial empirical gains across reasoning benchmarks and MoE setups, including up to 2.8x speedups in self-speculative decoding and up to 2.9x in AoT loading, without costly pre-training. The approach offers a practical route to more efficient on-device or resource-constrained decoding for both dense and sparse Transformer models, with broad applicability to dynamic compute, speculative decoding, and MoE inference.

Abstract

Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade-off between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment, enhancing inference efficiency. Evaluations on reasoning oriented tasks such as Koala, Self-Instruct, WizardLM, and MT-Bench show M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup. In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench, outperforming methods like 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, integrating early residual alignment with ahead-of-time expert loading into high-bandwidth memory (HBM) accelerates decoding, reduces expert-switching bottlenecks, and achieves a 2.9x speedup, making it highly effective in resource-constrained environments.

Paper Structure

This paper contains 21 sections, 8 equations, 12 figures.

Figures (12)

  • Figure 1: Traditional early exiting approaches approximate the final residual state with context-independent mapping, $\mathcal{T}$, applied on intermediate hidden state, resulting in discontinuities in transformations and lower similarity with final residual state. In contrast, M2R2 progressively enhances residual transformation velocity at each layer, enabling more robust and uniform early alignment.
  • Figure 2: (a) As residual streams propagate through the model, the directional shifts in the residuals become progressively smaller. (b) A dedicated model with $k$ layers achieves a faster rate of change in residual streams and higher alignment than base model leveraging early exit mechanisms at layer $k$.
  • Figure 3: Multi-rate Residuals Framework: Slow residual stream of base model is accompanied by a faster stream that operates at a $2-(J+1)\times$ rate relative to the slow stream, undergoing transformations via accelerator adapters as detailed in \ref{['m2r2_method']}, where J denotes number of early exit intervals. Colors within the slow and fast residual streams indicate similarity, with matching colors representing the most closely aligned residual states. At the beginning of the forward pass and at each exit point, the accelerated residual state is initialized from the corresponding slow residual state to avoid gradient conflict during training (see \ref{['sec:grad_conflict']}). Early exiting decisions are informed by the Accelerated Residual Latent Attention (ARLA) mechanism, described in \ref{['method_arla']}, which evaluates residual dynamics across consecutive exit gates.
  • Figure 4: Effectiveness of ARLA in capturing residual dynamics for early exiting decisions.
  • Figure 5: Ahead-of-Time Expert Loading: M2R2 accelerated residual stream predicts experts required for future layers, reducing reliance on on-demand lazy loading. Speculative pre-loading is efficiently overlapped with computation of multi-head attention (MHA) and MLP transformations. Only incorrectly speculated experts are loaded lazily, resulting in faster inference steps and improved computational efficiency. Here, H indicates LBM Host while D indicates HBM Device.
  • ...and 7 more figures