Table of Contents
Fetching ...

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen

TL;DR

GMU enables cross-layer memory sharing in a decoder-hybrid-decoder architecture, reducing cross-attention memory I/O from $O(d_{kv}N)$ to $O(d_h)$ while preserving linear prefill complexity. SambaY pairs Samba as the self-decoder with GMUs in the cross-decoder to boost decoding efficiency and long-context retrieval without explicit positional encoding. Extensive scaling and large-scale pre-training show SambaY achieving lower irreducible loss under compute scaling and enabling strong long-generation reasoning, including a 3.8B model that delivers up to 10x decoding throughput under vLLM. The work demonstrates practical gains for efficient long-context LLM reasoning and suggests future directions toward even tighter efficiency with sparse attention and broader memory modalities.

Abstract

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

TL;DR

GMU enables cross-layer memory sharing in a decoder-hybrid-decoder architecture, reducing cross-attention memory I/O from to while preserving linear prefill complexity. SambaY pairs Samba as the self-decoder with GMUs in the cross-decoder to boost decoding efficiency and long-context retrieval without explicit positional encoding. Extensive scaling and large-scale pre-training show SambaY achieving lower irreducible loss under compute scaling and enabling strong long-generation reasoning, including a 3.8B model that delivers up to 10x decoding throughput under vLLM. The work demonstrates practical gains for efficient long-context LLM reasoning and suggests future directions toward even tighter efficiency with sparse attention and broader memory modalities.

Abstract

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

Paper Structure

This paper contains 41 sections, 35 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Our decoder-hybrid-decoder architecture taking Samba ren2025samba as the self-decoder. Gated Memory Units (GMUs) are interleaved with the cross-attention layers in the cross-decoder to reduce the decoding complexity. As in YOCO sun2024cache, the full attention layer only need to compute the KV cache during prefilling with the self-decoder, leading to linear computation complexity for the prefill stage.
  • Figure 2: Validation Loss v.s. FLOPs (left) or Training Tokens (right) on the SlimPajama dataset. Besides the architecture comparisons, we also compare our $\mu$P++ based scaling with the Standard Parametrization (SP).
  • Figure 3: Accuracy (with error bars) v.s. Sliding Window Size on Phonebook with 32K evaluation length.
  • Figure 4: Throughput and latency of text generation with various architectures under the vLLM inference framework (using one A100-80GB GPU and no Tensor Parallelism). A normal distribution with 30% variance was applied to prompt and generation lengths with averages of 32000/2000 and 500/32000 respectively, following the setting in holmes2024deepspeed0fastgen0.
  • Figure 5: Major architectural variants explored in this section. For GDNY, we use Gated DeltaNet yang2025gated with normalization after output gate (GDN-A) for self-decoder, and apply normalized GMU (nGMU) in cross-decoder.
  • ...and 4 more figures