Attention Residuals

Kimi Team; Guangyu Chen; Yu Zhang; Jianlin Su; Weixin Xu; Siyuan Pan; Yaoyu Wang; Yucheng Wang; Guanduo Chen; Bohong Yin; Yutian Chen; Junjie Yan; Ming Wei; Y. Zhang; Fanqing Meng; Chao Hong; Xiaotong Xie; Shaowei Liu; Enzhe Lu; Yunpeng Tai; Yanru Chen; Xin Men; Haiqing Guo; Y. Charles; Haoyu Lu; Lin Sui; Jinguo Zhu; Zaida Zhou; Weiran He; Weixiao Huang; Xinran Xu; Yuzhi Wang; Guokun Lai; Yulun Du; Yuxin Wu; Zhilin Yang; Xinyu Zhou

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, Xinyu Zhou

Abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

Attention Residuals

Abstract

Paper Structure (51 sections, 19 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 51 sections, 19 equations, 10 figures, 5 tables, 1 algorithm.

Introduction
Motivation
Notation.
Training Deep Networks via Residuals
Residual Learning.
Generalizing Residuals.
Limitations.
Attention Residuals: A Unified View of Time and Depth
The Duality of Time and Depth.
Full Attention Residuals
Overhead.
Blockwise optimization.
Block Attention Residuals
Intra-Block Accumulation.
Inter-Block Attention.
...and 36 more sections

Figures (10)

Figure 1: Overview of Attention Residuals. (\ref{['fig:teaser-baseline']}) Standard Residuals: standard residual connections with uniform additive accumulation. (\ref{['fig:teaser-full']}) Full AttnRes: each layer selectively aggregates all previous layer outputs via learned attention weights. (\ref{['fig:teaser-block']}) Block AttnRes: layers are grouped into blocks, reducing memory from $O(Ld)$ to $O(Nd)$.
Figure 2: PyTorch-style pseudo code for Block Attention Residuals. block_attn_res computes $\operatorname{softmax}$ attention over block representations using a learned pseudo-query $\bm{w}_l$; forward is a single-layer pass that maintains partial_block ($\bm{b}_n^i$, intra-block residual) and blocks ($[\bm{b}_0, \ldots, \bm{b}_{n-1}]$, inter-block history).
Figure 3: Cache-based pipeline communication example with 4 physical ranks and 2 virtual stages per rank, where hatched boxes denote end of AttnRes blocks. Numbers indicate micro-batch indices. Each rank caches previously received blocks; stage transitions only transmit incremental blocks ($+[\bm{b}_1, \bm{b}_2]$) instead of the full history.
Figure 4: Scaling law curves for Attention Residuals. Both Full and Block AttnRes consistently outperform the baseline across all scales. Block AttnRes closely tracks Full AttnRes, recovering most of the gain at the largest scale.
Figure 4: Ablation on key components of AttnRes (16-layer model).
...and 5 more figures

Attention Residuals

Abstract

Attention Residuals

Authors

Abstract

Table of Contents

Figures (10)