Table of Contents
Fetching ...

Value Residual Learning

Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, Zhenzhong Lan

TL;DR

This work challenges the sufficiency of traditional hidden residuals in deep Transformer models by introducing ResFormer, which adds value residual connections from the first layer to later layers to better preserve token-level information. It further proposes SVFormer, a KV-cache–efficient variant that shares the first-layer value across layers, reducing memory costs during inference. Empirical results show ResFormer matches Transformer validation loss with significantly fewer parameters and training data, while SVFormer nearly halves KV-cache with a modest accuracy trade-off and compatibility with other KV-efficient methods. These findings offer a practical path to more parameter- and data-efficient deep Transformers, especially for long-sequence modeling.

Abstract

While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 16.11\% fewer model parameters and 20.3\% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.

Value Residual Learning

TL;DR

This work challenges the sufficiency of traditional hidden residuals in deep Transformer models by introducing ResFormer, which adds value residual connections from the first layer to later layers to better preserve token-level information. It further proposes SVFormer, a KV-cache–efficient variant that shares the first-layer value across layers, reducing memory costs during inference. Empirical results show ResFormer matches Transformer validation loss with significantly fewer parameters and training data, while SVFormer nearly halves KV-cache with a modest accuracy trade-off and compatibility with other KV-efficient methods. These findings offer a practical path to more parameter- and data-efficient deep Transformers, especially for long-sequence modeling.

Abstract

While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 16.11\% fewer model parameters and 20.3\% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.

Paper Structure

This paper contains 36 sections, 4 equations, 18 figures, 14 tables.

Figures (18)

  • Figure 1: Simplified illustration of the vanilla Transformer, NeuTRENO, DenseFormer, ResFormer, and SVFormer, with only three-layer structures and no operations other than attention. $\mathbf{A}^{i}$, $\mathbf{V}^{i}$, and $\mathbf{H}^{i}$ denote the attention matrix, value vectors, and attention outputs at the $i$-th layer, respectively. $\oplus$, $\ominus$, and $\otimes$ represent standard matrix addition, subtraction, and multiplication, respectively.
  • Figure 2: (Left) Validation loss as model size scales from 82M to 468M parameters on 20B tokens. (Medium) Validation loss for the 468M parameter model evaluated every 2B tokens. ResFormer achieves approximately 16.1%-20.3% reduction in both model parameters and training data. (Right) Validation loss for the 1.6B parameter model evaluated every 10B tokens.
  • Figure 3: The impact of varying $\bm\lambda$ values on 82M 8-layer Constant-ResFormer and NeuTRENO.
  • Figure 4: (Left) Average gradient norms of model outputs with respect to parameter matrices across different layers in Transformer and ResFormer. (Right) Comparison of Transformer and ResFormer performance across various learning rates during training.
  • Figure 5: (Left) Impact of value skip connections source from different layers on model performance, where all connections are identity connections and $\bm\lambda=1$ in Dense-ResFormer. (Right) Average validation loss of various Sparse-ResFormer configurations, which retain only single or multiple skip connections from $\mathbf{V}_{1}$.
  • ...and 13 more figures