M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
Nikhil Bhendawade, Mahyar Najibi, Devang Naik, Irina Belousova
TL;DR
The paper tackles the inefficiency of static residual transformations in autoregressive transformers by introducing Mixture of Multi-rate Residuals (M2R2), which dynamically modulates residual velocity to achieve earlier alignment of token representations. By pairing a slow residual stream with a rate-accelerated parallel residual stream, M2R2 employs accelerator adapters, shared KV caches, and ARLA to improve exit decisions, speculative decoding, and MoE AoT loading, while applying FLOPs-aware optimizations. It delivers substantial empirical gains across reasoning benchmarks and MoE setups, including up to 2.8x speedups in self-speculative decoding and up to 2.9x in AoT loading, without costly pre-training. The approach offers a practical route to more efficient on-device or resource-constrained decoding for both dense and sparse Transformer models, with broad applicability to dynamic compute, speculative decoding, and MoE inference.
Abstract
Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade-off between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment, enhancing inference efficiency. Evaluations on reasoning oriented tasks such as Koala, Self-Instruct, WizardLM, and MT-Bench show M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup. In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench, outperforming methods like 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, integrating early residual alignment with ahead-of-time expert loading into high-bandwidth memory (HBM) accelerates decoding, reduces expert-switching bottlenecks, and achieves a 2.9x speedup, making it highly effective in resource-constrained environments.
