Table of Contents
Fetching ...

Does Self-Attention Need Separate Weights in Transformers?

Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu, Ozlem Ozmen Garibay, Niloofar Yousefi

TL;DR

The paper tackles the high parameter and computational cost of self-attention by proposing a shared weight mechanism that uses a single matrix $W_s$ plus diagonal scalings to derive $Q$, $K$, and $V$ from a common representation. This full QKV sharing yields a substantial reduction in attention-block parameters by $66.53\%$ and total BERT parameters by $12.94\%$, while maintaining competitive results on GLUE and QA benchmarks and enhancing robustness to noise and domain shifts. Empirical results show the shared-weight BERT remains effective across GLUE tasks (e.g., MRPC, CoLA, STS-B) and provides faster training times, though QA performance experiences small declines relative to standard self-attention. The approach offers a practical pathway to more efficient transformer models suitable for deployment in resource-constrained environments and noisy data settings, with clear directions for extending to decoders and large-scale models.

Abstract

The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.

Does Self-Attention Need Separate Weights in Transformers?

TL;DR

The paper tackles the high parameter and computational cost of self-attention by proposing a shared weight mechanism that uses a single matrix plus diagonal scalings to derive , , and from a common representation. This full QKV sharing yields a substantial reduction in attention-block parameters by and total BERT parameters by , while maintaining competitive results on GLUE and QA benchmarks and enhancing robustness to noise and domain shifts. Empirical results show the shared-weight BERT remains effective across GLUE tasks (e.g., MRPC, CoLA, STS-B) and provides faster training times, though QA performance experiences small declines relative to standard self-attention. The approach offers a practical pathway to more efficient transformer models suitable for deployment in resource-constrained environments and noisy data settings, with clear directions for extending to decoders and large-scale models.

Abstract

The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Comparison of traditional self-attention (left) and shared weight self-attention (right).
  • Figure 2: Pretraining loss curves for the shared weight self-attention mechanism. The plot shows the loss for both training and validation sets over 200,000 steps.
  • Figure 3: Training Time Comparison Between shared Weight and standard self-attention on GLUE tasks. CoLA, MRPC, and QQP are recorded in seconds, and Other tasks are presented in minutes.